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Abstract 

We develop minimax optimal risk bounds for the general learning 
task consisting in predicting as well as the best function in a reference 
set Q up to the smallest possible additive term, called the convergence 
rate. When the reference set is finite and when n denotes the size of 
the training data, we provide minimax convergence rates of the form 
^( 1o n )" w ith tight evaluation of the positive constant C and with 
exact < v < 1, the latter value depending on the convexity of the 
loss function and on the level of noise in the output distribution. 

The risk upper bounds are based on a sequential randomized algo- 
rithm, which at each step concentrates on functions having both low 
risk and low variance with respect to the previous step prediction func- 
tion. Our analysis puts forward the links between the probabilistic and 
worst-case viewpoints, and allows to obtain risk bounds unachievable 
with the standard statistical learning approach. One of the key idea of 
this work is to use probabilistic inequalities with respect to appropriate 
(Gibbs) distributions on the prediction function space instead of using 
them with respect to the distribution generating the data. 

The risk lower bounds are based on refinements of the Assouad 
lemma taking particularly into account the properties of the loss func- 
tion. Our key example to illustrate the upper and lower bounds is to 
consider the L g -regression setting for which an exhaustive analysis of 
the convergence rates is given while q ranges in [1; +oo[. 



1 Introduction 

We are given a family Q of functions and we want to learn from data 
a function that predicts as well as the best function in Q up to some 
additive term called the convergence rate. Even when the set Q is 
finite, this learning task is crucial since 



1 



• any continuous set of prediction functions can be viewed through 
its covering nets with respect to (w.r.t.) appropriate (pseudo- 
distances and these nets are generally finite. 

• one way of doing model selection among a finite family of sub- 
models is to cut the training set into two parts, use the first part 
to learn the best prediction function of each submodel and use 
the second part to learn a prediction function which performs as 
well as the best of the prediction functions learned on the first 
part of the training set. 

From this last item, our learning task for finite Q is often referred 
to as model selection aggregation. It has two well-known variants. 
Instead of looking for a function predicting as well as the best in Q, 
these variants want to perform as well as the best convex combination 
of functions in Q or as well as the best linear combination of functions 
in Q. These three aggregation tasks are linked in several ways (see [50] 
and references within). 

Nevertheless, among these learning tasks, model selection aggrega- 
tion has rare properties. First, in general an algorithm picking func- 
tions in the set Q is not optimal (see e.g. [8, Theorem 2], [44, Theorem 
3], [23, p.14]). 

This means that the estimator has to look at an enlarged set of 
prediction functions. Secondly, in the statistical community, the only 
known optimal algorithms are all based on a Cesaro mean of Bayesian 
estimators (also referred to as progressive mixture rule). Thirdly, the 
proof of their optimality is not achieved by the most prominent tool 
in statistical learning theory: bounds on the supremum of empirical 
processes (see [53], and refined works as [13, 41, 46, 19] and references 
within) . 

The idea of the proof, which comes back to Barron [11], is based on 
a chain rule and appeared to be successful for least square and entropy 
losses [22, 23, 12, 58, 21] and for general loss in [38]. 

In the online prediction with expert advice setting, without any 
probabilistic assumption on the generation of the data, appropriate 
weighting methods have been showed to behave as well as the best 
expert up to a minimax-optimal additive remainder term (see [47, 29] 
and references within). In this worst-case context, amazingly sharp 
constants have been found (see in particular [37, 27, 28, 59]). These 
results are expressed in cumulative loss and can be transposed to model 
selection aggregation to the extent that the expected risk of the ran- 
domized procedure based on sequential predictions is proportional to 
the expectation of the cumulative loss of the sequential procedure (see 
Lemma 4.3 for precise statement). 

This work presents a sequential algorithm, which iterativcly up- 
dates a prior distribution put on the set of prediction functions. Con- 
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trarily to previously mentioned works, these updates take into account 
the variance of the task. As a consequence, posterior distributions 
concentrate on simultaneously low risk functions and functions close 
to the previously drawn prediction function. This conservative law is 
not surprising in view of previous works on high dimensional statisti- 
cal tasks, such as wavelet thresholding, shrinkage procedures, iterative 
compression schemes ([5]), iterative feature selection ([1]). 

The paper is organized as follows. Section 2 introduces the notation 
and the existing algorithms. Section 3 proposes a unifying setting to 
combine worst-case analysis tight results and probabilistic tools. It de- 
tails our sequentially randomized estimator and gives a sharp expected 
risk bound. In Sections 4 and 5, we show how to apply our main result 
under assumptions coming respectively from sequential prediction and 
model selection aggregation. While all this work concentrates on stat- 
ing results when the data are independent and identically distributed, 
Section 4.2 collects new results for sequential predictions, i.e. when no 
probabilistic assumption is made and when the data points come one 
by one (i.e. not in a batch manner). Section 6 contains algorithms 
that satisfy sharp standard-style generalization error bounds. To the 
author's knowledge, these bounds are not achievable with classical sta- 
tistical learning approach based on suprcmum of empirical processes. 
Here the main trick is to use probabilistic inequalities w.r.t. appro- 
priate distributions on the prediction function space instead of using 
them w.r.t. the distribution generating the data. Section 7 presents an 
improved bound for Xg-rcgrcssion (q > 1) when the noise has just a 
bounded moment of order s > q. This last assumption is much weaker 
than the traditional exponential moment assumption. Section 8 refines 
Assouad's lemma in order to obtain sharp constants and to take into 
account the properties of the loss function of the learning task. We 
illustrate our results by providing lower bounds matching the upper 
bounds obtained in the previous sections and by improving signifi- 
cantly the constants in lower bounds concerning Vapnik-Ccrvoncnkis 
classes in classification. Section 9 summarizes the contributions of this 
work and lists some related open problems. 

2 Notation and existing algorithms 

We assume that we observe n pairs Z\ = (X\ , Y\ ),..., Z n — (X n , Y n ) 
of input-output and that each pair has been independently drawn from 
the same unknown distribution denoted P. The input and output space 
are denoted respectively X and y, so that P is a probability distribu- 
tion on the product space Z = X x y. The target of a learning algo- 
rithm is to predict the output Y associated with an input X for pairs 
(X,Y) drawn from the distribution P. In this work, Z n +\ will denote 
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a random variable independent of the training set Z™ = (Zi, . . . ,Z n ) 
and with the same distribution P. The quality of a prediction function 
g : X — > y is measured by the risfc (also called expected loss or regret): 

R{g)±E z „pL(Z,g), 

where L(Z, g) assesses the loss of considering the prediction function g 
on the data Z £ Z. The symbol = is used to underline that the equality 
is a definition. When there is no ambiguity on the distribution that a 
random variable has, the expectation w.r.t. this distribution will simply 
be written by indexing the expectation sign E by the random variable. 
For instance, we can write R(g) = L(Z,g). More generally, when 
they are multiple sources of randomness, Ez means that we take the 
expectation with respect to the conditional distribution of Z knowing 
all other sources of randomness. 

We use L(Z,g) rather than L[Y,g(X)] to underline that our re- 
sults are not restricted to non-regularized losses, where we call non- 
regularized loss a loss that can be written as £[Y,g(X)] for some func- 
tion £ ■. y x y -> K. 

For any i € {0, ...,n}, the cumulative loss suffered by the pre- 
diction function g on the first i pairs of input-output, denoted Z\ for 
short, is 

i 

E l ( 5 )4^L(Z J ,.g), 

3=1 

where by convention we take So identically equal to zero. The symbol 
= is used to underline when a function is identical to a constant (e.g. 
So = 0). With slight abuse, a symbol denoting a constant function 
may be used to denote the value of this function. 

We assume that the set, denoted Q. of all prediction functions has 
been equipped with a cr-algcbra. Let T> be the set of all probability 
distributions on Q. By definition, a randomized algorithm produces 
a prediction function drawn according to a probability in T>. Let V 
be a set of probability distributions on Z in which we assume that 
the true unknown distribution generating the data is. The learning 
task is essentially described by the 3-tuple (Q,L,P) since we look for 
a possibly randomized estimator (or algorithm) g such that 

sup \^z^R{gzf) -min R(g)\ 

is minimized, where we recall that R(g) = ^z~p L(Z,g). To shorten 
notation, when no confusion can arise, the dependence of gz™ w.r.t. the 
training sample Z" will be dropped and we will simply write g. This 
means that we use the same symbol for both the algorithm and the 
prediction function produced by the algorithm on a training sample. 
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Wc implicitly assume that the quantities we manipulate are mea- 
surable: in particular, we assume that a prediction function is a mea- 
surable function from X to y, the mapping (x,y,g) y—> L[(x,y),g] is 
measurable, the estimators considered in our lower bounds are mea- 
surable, . . . 

The n-fold product of a distribution p., which is the distribution of 
a vector consisting in n i.i.d. realizations of /i, is denoted p® n . For 
instance the distribution of (Z±, . . . , Z n ) is P® n . 

The symbol C will denote some positive constant whose value may 
differ from line to line. The set of non-negative real numbers is denoted 
R+ = [0; +00 [. We define [a; J as the largest integer k such that k < x. 
To shorten notation, any finite sequence ai, . . . , a n will occasionally be 
denoted a" . For instance, the training set is Z™. 

To handle possibly continuous set Q, wc consider that Q is a measur- 
able space and that we have some 'prior distribution n on it. The set of 
probability distributions on Q will be denoted M. . The Kullback-Leibler 
divergence between a distribution peAi and the prior distribution tt 
is 

K , s A f * B ~p l0g(f( 5 )) if P «7T, 

1 +00 otherwise 

where ^ denotes the density of p w.r.t. 7r when it exists (i.e. p <C tt). 
For any p G A4, we have K(p,ir) > and when it is the uniform 
distribution on a finite set G, we also have K(p,ir) < \og\Q\. The 
Kullback-Leibler divergence satisfies the duality formula (sec e.g. [24, 
p. 10]): for any real- valued measurable function h defined on Q, 

M {E 3 ^ p h{g) + K(p,n)} = -logE^ e~ h ^). (2 .1) 
and that the infimum is reached for the Gibbs distribution 

7T- h (dg) = Eg , ^e-M*') • < d 9)- (2.2) 

Intuitively, the Gibbs distribution n-h concentrates on prediction func- 
tions g that are close to minimizing the function h : Q — > R. 

For any p £ M., E g ^ p E g ^, p g{x) = J g(x)p(dg) is called a 

mixture of prediction functions. When Q is finite, a mixture is simply 
a convex combination. Throughout this work, whenever wc consider 
mixtures of prediction functions, we implicitly assume that E g ^ p g{x) 
belongs to y for any x so that the mixture is a prediction function. 
This is typically the case when y is an interval of R. 

We will say that the loss function is convex when the function 
g 1 ► L(z,g) is convex for any z € Z, equivalently L(z,E g ^, p g) < 
E g ^ p L(z, g) for any p G M. and z G Z. In this work, we do not assume 
the loss function to be convex except when it is explicitly mentioned. 

The algorithm used to prove optimal convergence rates for several 
different losses (see e.g. [22, 23, 12, 18, 58, 21, 38]) is the following: 
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Algorithm A: Let A > 0. Predict according to X)"=o ^s^^-ae 3> 
where we recall that Sj maps a function g £ Q to its cumulative loss 
up to time i. 

In other words, for a new input x, the prediction of the output given 
by Algorithm A is ^ £" =0 ^jf^W^^ ' Algorithm A has also 
been used with the classification loss. For this non-convex loss, it has 
the same properties as the empirical risk minimizcr on Q ([43, 42]). To 
give the optimal convergence rate, the parameter A and the distribution 
7r should be appropriately chosen. When Q is finite, the estimator 
belongs to the convex hull of the set Q. 

From Vovk, Haussler, Kivinen and Warmuth works ([56, 37, 57]) 
and the link between cumulative loss in online setting and expected risk 
in the batch setting (see later Lemma 4.3), an "optimal" algorithm is: 

Algorithm B: Let A > 0. For any i <G {0, . . . , n}, let hi be a prediction 
function such that 

VzeZ L{zX) < -ilogE s ^_ AEi e -^(«.fl). 

If one of the hi does not exist, the algorithm is said to fail. Otherwise 
it predicts according to ^rj 5Z™_ ^i- 

In particular, for appropriate A > 0, this algorithm does not fail 
when the loss function is the square loss (i.e. L(z,g) = [y — g(x)] ) and 
when the output space is bounded. Algorithm B is based on the same 
Gibbs distribution 7t_a£; as Algorithm A. Besides, in [37, Example 
3.13], it is shown that Algorithm A is not in general a particular case 
of Algorithm B, and that Algorithm B will not generally produce a 
prediction function in the convex hull of Q unlike Algorithm A. In 
Sections 4 and 5, we will see how both algorithms are connected to the 
SeqRand algorithm presented in the next section. 

3 The algorithm and its generalization er- 
ror bound 

The aim of this section is to build an algorithm with the best possible 
minimax convergence rate. The algorithm relies on the following cen- 
tral condition for which we recall that Q is a subset of the set Q of all 
prediction functions and that M. and T> are the sets of all probability 
distributions on respectively Q and Q . 

For any A > 0, let 8\ be a real-valued function defined on Z x Q x Q 
that satisfies the following inequality, which will be referred to as the 



variance inequality 

ypeM 3n{p)€V 

sup Se z „p E g ,^ {p) logE^ p e *[W)-W.9)-s»V,9, B 'j\ 1 < . 
Per L J 

The variance inequality is our probabilistic version of the generic 
algorithm condition in the online prediction setting (see [56, proof of 
Theorem 1] or more explicitly in [37, p. 11]), in which we added the 
variance function 5\ . Our results will be all the sharper as this variance 
function is small. To make the variance inequality more readable, let 
us say for the moment that 

• without any assumption on "P, for several usual "strongly" convex 
loss functions, we may take S\ = provided that A is a small 
enough constant (see Section 4). 

• the variance inequality can be seen as a "small expectation" in- 
equality. The usual viewpoint is to control the quantity L(Z 7 g) 
by its expectation w.r.t. Z and a variance term. Here, roughly, 
L(Z,g) is mainly controlled by L(Z 7 g') where g' is appropriately 
chosen through the choice of Tt(p), plus the additive term 8\. By 
definition this additive term does not depend on the particular 
probability distribution generating the data and leads to empiri- 
cal compensation. 

• in the examples we will be interested in throughout this work, 
7r(p) will be either equal to p or to a Dirac distribution on some 
function, which is not necessarily in Q . 

• for any loss function L, any set V and any A > 0, one may choose 
6 X (Z, g, g') = | [L(Z, g) - L(Z, of)} 2 (see Section 6). 

Our results concern the sequentially randomized algorithm described 
in Figure 1 , which for sake of shortness we will call the SeqRand algo- 
rithm. 

Remark 3.1. When 5\{Z,g,g') docs not depend on g, we recover a 
more standard-style algorithm to the extent that we then have ft-xst — 
7r_AEi ■ Precisely our algorithm becomes the randomized version of Al- 
gorithm A. When S\(Z, g, g') depends on g, the posterior distributions 
tend to concentrate on functions having small risk and small variance 
term. In Section 6, we will take 8\{Z,g,g') = ^[L(Z,g) — L(Z,g')] . 
This choice implies a conservative mechanism: roughly, with high prob- 
ability, among functions having low cumulative risk E,; , gi will be cho- 
sen close to cji-i- 

For any i G {0,...,rt}, the quantities Si, pt and gi depend on 
the training data only through Z{, where we recall that Z\ denotes 
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Input: A > and ir a distribution on the set Q. 

1. Define po — 7r(ir) in the sense of the variance inequality (p. 7) and draw 
a function go according to this distribution. Let So(g) = for any 

2. For any % 6 {1, . . . , n}, iteratively define 

Si(g) = Si-i(g) +L(Zi,g) +6x(Z i ,g,g i - 1 ) for any g € Q. (3.1) 
and 

Pi = 7r(7r_A5i) i R the sense of the variance inequality (p. 7) 

and draw a function gi according to the distribution p^. 

3. Predict with a function drawn according to the uniform distribution 
on the finite set j^O) • • • >5n}- 

Conditionally to the training set, the distribution of the output predic- 
tion function will be denoted ft. 



Figure 1: The SeqRand algorithm 
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(Zi, . . . , Zi). Besides they are also random to the extent that they 
depend on the draws of the functions go, ... , gi-\. 

The SeqRand algorithm produces a prediction function which has 
three causes of randomness: the training data, the way gt is obtained 
(step 2) and the uniform draw (step 3). For fixed Z\ (i.e. conditional to 
Z[), let fli denote the joint distribution of g = (go, ■ ■ ■ The ran- 
domizing distribution (l of the output prediction function by SeqRand 
is the distribution on Q corresponding to the last two causes of ran- 
domness. From the previous definitions, for any function h : Q — > M, 
we have E^ h(g) = E g ^^n n ^ J2?=o h (<ji)- 0ur main upper bound 
controls the expected risk Ez?E g ^p, R(g) of the SeqRand procedure. 

Theorem 3.1. Let A x (g,g') = E Z ~ P S x (Z,g,g') for g G G and 
g' G Q , where we recall that 8 X is a function satisfying the variance 
inequality (see p.7). The expected risk of the SeqRand algorithm satis- 
fies 

Ez? Eg'^jx R(g') < mm |e 9 ^ p R(g)+E^ p E z «E g ,^ A x (g,g') + §^ 

(3.2) 

In particular, when Q is finite and when the loss function L and the 
set V are such that 5\ = 0, by taking ir uniform on Q , we get 

E Zf E g ^ R(g) < mm R + ±0j (3.3) 

Proof. Let £ denote the expected risk of the SeqRand algorithm: 

£ 4 E Z nE g ^ R(g) = ^£? =0 E Z jE^ nj R{g % ). 

We recall that Z n +\ is a random variable independent of the training 
set Zi and with the same distribution P. Let S n +i be defined by (3.1) 
for i = n + 1. To shorten formulae, let iti = Tr_xSi so that by definition 
we have pt = Ttifii). The variance inequality implies that 

E^~*(p) R(g') < -{E z E g ,^ (p) logE 3 ^ p e -mz,9)+Sx(z, g , g '))_ 

So for any i G {0, . . . , n}, for fixed g % Q = (g , . . . , cji-i) and fixed Z\, 
we have 

E g ^ Pl R(g') < —jEz i+1 E g >^^ logE^ e -A[i(s«+i.»)+MS« + W)] 
Taking the expectations w.r.t. (Z\, (?q _1 ), we get 

Ez^Rigi) = E^Ejj-iE^* R(g') 

< -iE z i+iE § i logE a „ #j e -A[i(^ + i,ff)+^(^+i,ff,3«)]. 
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Consequently, by the chain rule (i.e. cancellation in the sum of loga- 
rithmic terms; [11]) and by intensive use of Fubini's theorem, we get 

— 1 Iff „,,E.»V" WE - P ~KL(Z i+1 ,g)+S x (Z i+1 ,g,g,)] 

1 TV TW 1 f®g~* e" AS »+i <s) \ 

= _ ACn+TJ 2" + 9o E I= o!og ( |— e -AS <( .) j 
1 TIT Trp -] e- AS « + i<"'\ 



logEg^Tr e" 



Now from the following lemma, we obtain 

= ^ { E ^p R ^ + E ^ E ^ ^H^ 2 + ■ 

Lemma 3.2. Let W be a real-valued measurable function defined on a 
product space A\ x A 2 and let fii and fi 2 be probability distributions on 
respectively A\ and A 2 such that E ai ^ tl logE a2 ^ /J2 e - w ( a i> a 2) < +00. 
We have 

-E ai ^ logE a2 ^ 2 e- w ^ a ^ < -logE a2 ^ 2 e - E »i~« w(«i,a a ). 

Proof. By using twice (2.1) and Fubini's theorem, we have 

-E ai logE a2 ^ M2 e -W{aua 2 ) = EaiM | Ea ^ p W{a 1 ,a 2 ) + K{p,fjL 2 )} 

< ■w£E ai {E a2 ^ p W(a 1 ,a 2 ) + K(p,fj, 2 )} 
= -logE a2 ^ 2 e ~ E ^ w ^^\ 



□ 

Inequality (3.3) is a direct consequence of (3.2). □ 

Theorem 3.1 bounds the expected risk of a randomized procedure, 
where the expectation is taken w.r.t. both the training set distribu- 
tion and the randomizing distribution. From the following lemma, for 
convex loss functions, (3.3) implies 

Ez- R(E g ^ 9) < mm R + (3.4) 

where we recall that jx is the randomizing distribution of the ScqRand 
algorithm and A is a parameter whose typical value is the largest A > 
such that 8\ = 0. 
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Lemma 3.3. For convex loss functions, the doubly expected risk of 
a randomized algorithm is greater than the expected risk of the deter- 
ministic version of the randomized algorithm, i.e. if p denotes the 
randomizing distribution, we have 

E Z nR(E g ^ g) < E z »E fl ^ R{g). 

Proof. The result is a direct consequence of Jensen's inequality. □ 

In [27], the authors rely on worst-case analysis to recover standard- 
style statistical results such as Vapnik's bounds [54]. Theorem 3.1 can 
be seen as a complement to this pioneering work. Inequality (3.4) is the 
model selection bound that is well-known for least square regression 
and entropy loss, and that has been recently proved for general losses 
in [38]. 

Let us discuss the generalized form of the result. The r.h.s. of (3.2) 
is a classical regularized risk, which appears naturally in the PAC- 
Bayesian approach (see e.g. [25, 24, 7, 61]). An advantage of stating the 
result this way is to be able to deal with uncountable infinite Q . Even 
when Q is countable, this formulation has some benefit to the extent 
that for any measurable function h : Q — » R, min pe _A4{E g ^ p h(g) + 
K(j>, tt)} < mm {his) + logTr- 1 ^)}. 

Our generalization error bounds depend on two quantities A and tt 
which are the parameters of our algorithm. Their choice depends on the 
precise setting. Nevertheless, when Q is finite and with no particular 
structure a priori, a natural choice for tt is the uniform distribution on 
Q. 

Once the distribution tt is fixed, an appropriate choice for the pa- 
rameter A is the minimizer of the r.h.s. of (3.2). This minimizer is 
unknown by the statistician, and it is an open problem to adaptively 
choose A close to it in this general context. Solutions for specific se- 
quential prediction frameworks are known (see [10, Section 2] and [30, 
Lemma 3]). They are based on incremental updating of A. In ap- 
pendix, one may found a slight improvement of the argument used in 
the forementioned works, based on Lemma D.2. 

4 Link with sequential prediction 

This section aims at providing examples for which the variance in- 
equality (p. 7) holds, at stating results coming from the online learning 
community in our batch setting (Section 4.1), and at providing new 
results for the sequential prediction setting in which no probabilistic 
assumption is made on the way the data are generated (Section 4.2). 
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4.1 From online to batch 



In [56, 37, 57], the loss function is assumed to satisfy: there are positive 
numbers rj and c such that 

L[(x, y), g p ] < -I logE^ p e -^[(^) >9 ] v 4 - 1 -* 

Remark 4.1. If <? i— ► eT^^-B) j s concave, then (4.1) holds for c = 1 
(and one may take g p = E 9 ^, p g). 

Assumption (4.1) implies that the variance inequality is satisfied 
both for A = r\ and 5\(Z,g,g') = (1 — l/c)L(Z, g') and for A = 77/c 
and 5\(Z,g,g') = (c — l)L(Z,g), and we may take in both cases 7r(p) 
as the Dirac distribution at g p . This leads to the same procedure that 
is described in the following straightforward corollary of Theorem 3.1. 

Corollary 4.1. Let g w s . be defined in the sense of (4.1) (for p = 
TT-jjSi/L Consider the algorithm which predicts by drawing a function 
in {g-n_ rlllo , ■ ■ ■ , 57r_^ Sn } according to the uniform distribution. Under 
Assumption (4.1), its expected risk E^ji ^-j- J27=o ) is upper 

bounded by 

cvfa{E g „ p R<jg) + $$}. (4.2) 

This result is not surprising in view of the following two results. 
The first one comes from worst-case analysis in sequential prediction. 

Theorem 4.2 (Haussler et al. [37], Theorem 3.8). Let Q be countable. 
For any g € Q, let = ^2j = iL{Zj,g) (still) denote the cumula- 

tive loss up to time i of the expert which always predicts according to 
function g. Under Assumption (4.1), the cumulative loss on Z™ of the 
strategy in which the prediction at time i is done according to g-K_^. 
in the sense of (4.1) (for p = TT- n ^ i _ 1 ) is bounded by 

m£ gB g{cS n {g) + § logTr" 1 ^)}- (4.3) 

The second result shows how the previous bound can be transposed 
into our model selection context by the following lemma. 

Lemma 4.3. Let A be a learning algorithm which produces the predic- 
tion function A{Z\) at timei + l, i.e. from the data Z\ = (Zi,...,Zi). 
Let £ be the randomized algorithm which produces a prediction function 
£{Z™) drawn according to the uniform distribution on {.A(0), A(Zi), . . . ,A(Z")}. 
The (doubly) expected risk of £ is equal to times the expectation 
of the cumulative loss of A on the sequence Z\, .. . , Z n+ \. 

Proof. By Fubini's theorem, we have 

ER[£(Z?)} = -J-^ =0 E z ,R[A(Zl)] 

= ^Eto^zrm + i,AZi)] 



12 



□ 



For any 77 > 0, let c{rf) denote the infimum of the c for which (4.1) 
holds. Under weak assumptions, Vovk ([57]) proved that the infimum 
exists and studied the behavior of c(rj) and 0(77) = c(r])/r], which are 
key quantities of (4.2) and (4.3). Under weak assumptions, and in 
particular in the examples given in Table 1, the optimal constants 
in (4.3) are 0(77) and a(rf) ([57, Theorem 1]) and we have c{rf) > 1, 
i] 1 ► c(rj) nondccrcasing and r\ t—> a(r\) nonincreasing. From these last 
properties, we understand the trade-off which occurs to choose the 
optimal r\. 





Output space 


Loss L(Z,g) 


c(rj) 


Entropy loss 
[37, Example 4.3] 


y = [0; 1] 


Mot) 


c(rf) = 1 if 77 < 1 
c(rj) = 00 if 77 > 1 


Absolute loss game 
[37, Section 4.2] 


y = [0; 1] 


\Y-g(X)\ 


V 

21og[2/(l+e-1)] 
= 1 + 77/4 + 0(77) 


Square loss 
[37, Example 4.4] 


y = [-B,B] 


[Y - g(x)Y 


c(jj) = 1 if 77 < 1/(2B 2 ) 
c(rj) = +00 if 77 > 1/(2B 2 ) 


I/q-lOSS 

(see p. 13) 


y = [-B,B] 


\Y-g(X)\« 

<Z>1 


0(77) = 1 

if^<f^(lA2 2 -") 



Table 1: Value of c(ry) for different loss functions. Here B denotes a positive 
real. 



Table 1 specifies (4.2) in different well-known learning tasks. For 
instance, for bounded least square regression (i.e. when |y| < B for 
some B > 0), the generalization error of the algorithm described in 
Corollary 4.1 when 77 = l/(2£? 2 ) is upper bounded by 

min peA <{E 9 ^ R(g) + 2B 2 M}. (4.4) 

The constant appearing in front of the Kullback-Lciblcr divergence is 
much smaller than the ones obtained in unbounded regression setting 
even with gaussian noise and bounded regression function (see [21, 38] 
and [25, p. 87]). The differences between these results partly comes 
from the absence of boundedness assumptions on the output and from 
the weighted average used in the aforementioned works. Indeed the 
weighted average prediction function, i.e. E ff ^ p g, docs not satisfy 
(4.1) for c = 1 and 77 = 1/(2_B 2 ) as was pointed out in [37, Example 
3.13]. Nevertheless, it satisfies (4.1) for c = 1 and 77 < 1/(8B 2 ) (by 
using the concavity of x 1— > e~ x ~ on [—1/^/2; l/y/2] and Remark 4.1), 
which leads to similar but weaker bound (see (4.2)). 

Case of the L g -losses. To deal with these losses, we need the 
following slight generalization of the result given in Appendix A of 
[39]. 



13 



Theorem 4.4. Let y = [a; b] . We consider a non-regularized loss 
function, i.e. a loss function such that L(Z,g) = £[Y,g(X)] for any 
Z = (X, Y) G Z and some function £ : y x y — ► R. For any y G y , let 
£ y be the function W i— ► £(y,y')] ■ If for any y G y 

• £ y is continuous on y 

• £ y decreases on [a;y], increases on [y;b] and £ y (y) = 

• £ y is twice differ entiable on the open set (a;y) U (y;b), 
then (4.1) is satisfied for c = 1 and 

■ f iy 1 (yK 2 (y)-ey 1 (v)^ 2 (y) , , 

where the infimum is taken w.r.t. yi,y and y%. 

Proof. See Section 10.1. □ 

Remark 4.2. This result simplifies the original one to the extent that 
£ y does not need to be twice diffcrentiable at point y and the range of 
values for y in the infimum is (y\\ j/2) instead of (a; b). 

Corollary 4.5. For the L q -loss, when y = [— B;B] for some B > 0. 
condition (4.1) is satisfied for c — 1 and 

?7<fi±(lA2 2 -<0 

Proof. We apply Theorem 4.4. By simple computations, the r.h.s. of 
(4.5) is 

m f (g-i)fo2-yi) 

-B< yi <y<y 2 <B <?(j/-j/i )(y2-y)[(y-yi) q - 1 +(y2-y) q - 1 ] 

<?-i : n f 1 

- q(2B)« t(l-t)[t<!-i + (l-t)'!-i] 

For 1 < g < 2, the infimum is reached for t — 1/2 and (4.5) can be 
written as 77 < For q > 2, since the previous infimum is larger than 

info<t<i t (i^_t) = 4, (4.5) is satisfied at least when 77 < j^fgre ■ D 
4.2 Sequential prediction 

First note that using Corollary 4.5 and Theorem 4.2, we obtain a new 
result concerning sequential prediction for L q loss. Nevertheless this 
result is not due to our approach but on a refinement of the argument in 
[39, Appendix A]. In this section, we will rather concentrate on giving 
results for sequential prediction coming from the arguments underlying 
Theorem 3.1. 

In the online setting, the data points come one by one and there 
is no probabilistic assumption on the way they are generated. In this 
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case, one should modify the definition of the variance function into: 
for any A > 0, let S\ be a real-valued function defined on Z x Q x Q 
that satisfies the following online variance inequality 



E g '~fr(p) logE 9 ^p e 



: \[L(Z,g')-L(Z,g)-5 x (Z,g,g')\ 



< o. 



The only difference with the variance inequality (defined in p. 7) is the 
removal of the expectation with respect to Z. Naturally if S\ satisfies 
the online variance inequality, then it satisfies the variance inequality 
The online version of the SeqRand algorithm is described in Figure 2. 
It satisfies the following theorem whose proof follows the same line as 
the one of Theorem 3.1. 



Input: A > and ir a distribution on the set Q. 

1. Define po = 7r(-7r) in the sense of the online variance inequality (p. 15) 
and draw a function go according to this distribution. For data Z\, 
predict according to go. Let So(g) = for any g S Q. 

2. For any i G {1, . . . ,n — 1}, define 



Pi = ^(iT^xSi) hi the sense of the online variance inequality (p. 15) 

and draw a function gi according to the distribution pi. For data Zj+i, 
predict according to g^. 



Si(g) = Si 



i(fiO +L(Zi,9) + 5\(Zi,g,gi-i) for any g G Q. 



and 



Figure 2: The online SeqRand algorithm 



Theorem 4.6. The cumulative loss of the online SeqRand algorithm 
satisfies 



n 



^2^gi-iL( Z i,9i-l) 




< min i E g ^ p ^ L(Zj, g)+E g ^ p E 8 n-i S\(Zj, g, gj_i)+ 
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In particular, when Q is finite, by taking ir uniform on Q , we get 



E 



$\{Z i ,g,g l -i)+ 

i=i i=i 

Up to the online variance function 8\ , the online variance inequality 
is the generic algorithm condition of [37, p. 11]. So cases where S\ are 
equal to zero are already known. Now new results can be obtained by 
using that for any loss function L and any A > 0, the online variance 
inequality is satisfied for 8\{Z,g,g') = ^[L(Z,g) — L(Z,g')\ (proof 
in Section 10.3). The associated distribution 7r(p) is then just p. This 
leads to the following corollary. 

Corollary 4.7. The cumulative loss of the online SeqRand algorithm 
with 5\{Z, g, g') = | [L(Z, g) - L(Z, g')] and w(p) = p for any pe M 
satisfies 

n s n 

zZ E 9,-i L ( Z ^9t-i) < min <E g ^p ^L(Z t ,g) 
i=i 9 ^ i=i 

(4.6) 

Note that the prediction functions gt appears in both the left-hand 
side and the right-hand side of (4.6). For loss functions taking their val- 
ues in an interval of range A, we have [L(Zi, g) — L(Zi, gi-i)] 2 < A 2 . So 
when Q is finite, by taking it uniform on Q and A = y/(2 log \ G\)/ in A 2 ) , 
we obtain that the cumulative regret satisfies the more explicit cumu- 
lative regret bound: 

n n 

VE^i^&.O-rnin Vl(^ )S ) < A^n~^i\g\. (4.7) 

This bound is loose by a factor 2 (see [29, Theorem 2.2]). Neverthe- 
less an advantage of the online SeqRand algorithm with 8\{Z,g,g') = 
^\L(Z, g) — L(Z,g')] is that it will take advantage of situations in 
which 

n 

^E^ p E 3 n-i [L(Zi,g) - L(Z,,.g l _ 1 )] 2 < nA 2 , 
»=i 

whereas it is not clear that the exponentially weighted average fore- 
caster does. The proper tuning of the parameter A is a nontrivial 
task, whereas for the exponentially weighted average forecaster with 
incremental updates, it has been recently proved that one can tunc 
this parameter without any prior knowledge on the loss sequences [30, 



log \g\ 
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Theorem 6]. The argument given in the appendix p. 71 can be applied 
in order to use incremental updates, but for the online SeqRand al- 
gorithm with 5\(Z,g,g') = ^\L(Z,g) — L(Z,g')\ , we do not know 
how to choose the updates in order to recover a result similar to [30, 
Theorems 5 and 6], 

5 Model selection aggregation under Ju- 
ditsky, Rigollet and Tsybakov assumptions 
([38]) 

The main result of [38] relies on the following assumption on the loss 
function L and the set V of probability distributions on Z in which 
we assume that the true distribution is. There exist A > and a 
real- valued function ip defined on Q x Q such that for any P E V 

E Z ~P e HL(Z,g')-L(Z,g)] < ^ g j for &ny ^ g l £ g 

i>(9,9) = l for any (5.1) 
the function \g i— > ip{g',g)\ is concave for any g' G Q 

Theorem 3.1 gives the following result. 

Corollary 5.1. Consider the algorithm which draws uniformly its pre- 
diction function in the set {E s ^, 7r _ ASo g, . . . , E g ^ w _ XSn g}. Under As- 
sumption (5.1), its expected risk Mz™ ^^Y^i=o R(^g~n- xx 9) * s upper 
bounded by 

min {E g „ p R(g) + §^}. (5 .2) 

Proof. We start by proving that the variance inequality holds with 
5\ = 0, and that we may take tt{p) as the Dirac distribution at the 
function E g ^, p g. By using Jensen's inequality and Fubini's theorem, 
Assumption (5.1) implies that 

Eg>~*0>) E ^P IogE fl ^ e HHZ,9')-HZ <g )] 

= E Z ^ P logE g ^ p e x ^ z ^'^3')-L(z^] 

< logE^ p Ez^P e^^'- 9')-HZ, g )] 

< logE 9 ^, p -0(E 3 '~ P g',g) 

< logtp(E g ^ p g',E g ^ p g) 
= 0, 

so that we can apply Theorem 3.1. It remains to note that in this 
context the SeqRand algorithm is the one described in the corollary. 

□ 
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In this context, the SeqRand algorithm reduces to the randomized 
version of Algorithm A. From Lemma 3.3, for convex loss functions, 
(5.2) also holds for the risk of Algorithm A. Corollary 5.1 also shows 
that the risk bounds for Algorithm A proved in [38, Theorem 3.2 and 
the examples of Section 4.2] hold with the same constants for the Se- 
qRand algorithm (provided that the expected risk w.r.t. the training 
set distribution is replaced by the expected risk w.r.t. both training 
set and randomizing distributions). 

On Assumption (5.1) we should say that it does not a priori re- 
quire the function L to be convex. Nevertheless, any known relevant 
examples deal with "strongly" convex loss functions and we know that 
in general the assumption will not hold for the SVM (or hinge) loss 
function and for the absolute loss function. Indeed, without further 
assumption, one cannot expect rates better than for these loss 

functions (see Section 8.4.2). 

By taking the appropriate variance function 5\(Z, g, g'), it is possi- 
ble to prove that the results in [38, Theorem 3.1 and Section 4.1] holds 
for the SeqRand algorithm (provided that the expected risk w.r.t. the 
training set distribution is replaced by the expected risk w.r.t. both 
training set and randomizing distributions). The choice of 5\(Z, g, <?'), 
which for sake of shortness we do not specify, is in fact such that the 
resulting SeqRand algorithm is again the randomized version of Algo- 
rithm A. 

6 Standard-style statistical bounds 

This section proposes new results of a different kind. In the previous 
sections, under convexity assumptions, we were able to achieve fast 
rates. Here we have assumption neither on the loss function nor on 
the probability generating the data. Nevertheless we show that the 
SeqRand algorithm applied for 8 x {Z,g,g') = X[L(Z,g) - L{Z,g')] 2 /2 
satisfies a sharp standard-style statistical bound. 

This section contains two parts: the first one provides results in 
expectation (as in the preceding sections) whereas the second part 
provides deviation inequalities on the risk that requires advances on 
the sequential prediction analysis. 

6.1 Bounds on the expected risk 
6.1.1 Bernstein's type bound 

Theorem 6.1. LetV(g,g') = E z {[L(Z,g) -L(Z,g')} 2 }. Consider the 
SeqRand algorithm (see p. 8) applied with 5\(Z, g, g') = X[L(Z,g) — 
L(Z, g')] 2 /2 and tt(p) = p. Its expected risk E.zj>IEg~/i Rig); where we 
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recall that ft denotes the randomizing distribution, satisfies 

E^E^p R(g') < mm {E g ^ p R(g) + |E^ p E z « E fl ,„ A + g 

(6.1) 

Proof. See Section 10.3. □ 

To make (6.1) more explicit and to obtain a generalization error 
bound in which the randomizing distribution does not appear in the 
r.h.s. of the bound, the following corollary considers a widely used 
assumption relating the variance term to the excess risk (see Mammon 
and Tsybakov [45, 52], and also Polonik [48]). Precisely, from Theorem 
6.1, we obtain 

Corollary 6.2. If there exist < 7 < 1 and a prediction function 
g (not necessarily in Q) such that V{g,g) < c[R(g) — P(.<?)] 7 for any 
g € G, the expected risk E = E^pEg^ R{g) of the SeqRand algorithm 
used in Theorem 6.1 satisfies 

• When 7=1, 

B R(g) < min {i±i [E fl „, R(ff) - Rig)} + (1 _f A ^ +1) } 
In particular, forQ finite, -k the uniform distribution, A = l/(2c), 



when g belongs to Q , we get £ < mm R(g) + ■ 
When 7 < 1 , for any 0<f3<land for R(g) = R(g) - R(g), 
£ R(§) < min (E gr ^ p [R(g) + cXR^g)} + } V ( T 



A 

-0 



Proof. See Section 10.4. □ 

To understand the sharpness of Theorem 6.1, we have to compare 
this result with the following one that comes from the traditional (PAC- 
Bayesian) statistical learning approach which relies on supremum of 
empirical processes. In the following theorem, we consider the esti- 
mator minimizing the uniform bound, i.e. the estimator for which we 
have the smallest upper bound on its generalization error. 

Theorem 6.3. We still use V(g,g') = E z {[L(Z,g) -L(Z,g')} 2 }. The 
generalization error of the algorithm which draws its prediction func- 
tion according to the Gibbs distribution 7t_as„ satisfies 

Ez r E 9 ^,_ ASn Rig') 

< min (e^ p Rig) + K{p ? n )+1 + \E g ~ P E z? E g ^ w _^ n Vig,g') 

+AiEr=iE s ^E zf E s ^ 7r _, En [LiZ l ,g)-LiZ l ,g / )] 2 Y 

(6.2) 
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Let <p be the positive convex increasing function defined as ip{t) = 

6 and tp(0) — \ by continuity. When sup zgZ ,gee, s 'ee \L{z,g r ) — 

L(z,g)\ < B, we also have 

E^E^.^ R{g>) < min {e^ p R(g) 

p€M 1 , (6.3) 

+\ i p(\B)E^ p E z ,E g ,^_ XSn V{g,g') + ^2±ij. 

Proof. See Section 10.5. □ 

As in Theorem 6.1, there is a variance term in which the randomiz- 
ing distribution is involved. As in Corollary 6.2, one can convert (6.3) 
into a proper generalization error bound, that is a non trivial bound 
E2«E g ^ 7r _ AEn R(g) < S(n, 7r, A) where the training data do not appear 
in B(n, tt, A). 

By comparing (6.3) and (6.1), we see that the classical approach 
requires the quantity sup„ e g g i e g\L(Z, g') — L(Z,g)\ to be uniformly 
bounded and the unplcasing function ip appears. In fact, using techni- 
cal small expectations theorems (see e.g. [4, Lemma 7.1]), exponential 
moments conditions on the above quantity would be sufficient. 

The symmctrization trick used to prove Theorem 6.1 is performed 
in the prediction functions space. We do not call on the second vir- 
tual training set currently used in statistical learning theory (see [54]). 
Nevertheless both symmetrization tricks end up to the same nice prop- 
erty: we need no boundedness assumption on the loss functions. In our 
setting, symmctrization on training data leads to an unwanted expec- 
tation and to a constant four times larger (see the two variance terms 
of (6.2) and the discussion in [5, Section 8.3.3]). 

In particular, deducing from Theorem 6.3 a corollary similar to 
Corollary 6.2 is only possible through (6.3) and provided that we have a 
boundedness assumption on swp zeZ ge g g / e g \L(z, g')—L(z, g)\. Indeed 
one cannot use (6.2) because of the last variance term in (6.2) (since 
E n depends on Zi). 

Our approach has nevertheless the following limit: the proof of 
Corollary 6.2 does not use a chaining argument. As a consequence, 
in the particular case when the model has polynomial entropies (see 
e.g. [45]) and when the assumption in Corollary 6.2 holds for 7 < 1 
(and not for 7=1), Corollary 6.2 does not give the minimax optimal 
convergence rate. Combining the better variance control presented 
here with the chaining argument is an open problem. 

6.1.2 Hoeffding's type bound 

Contrary to generalization error bounds coming from Bernstein's in- 
equality, (6.1) does not require any boundedness assumption. For 
bounded losses, without any variance assumption (i.e. roughly when 
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the assumption used in Corollary (6.2) does not hold for 7 > 0), tighter 
results are obtained by using Hocffding's inequality, that is: for any 
random variable W satisfying a < W < b, then for any A > 

Ee \(W-EW) < e A 2 (&-a) 2 /8^ 

Theorem 6.4. Assume that for any z <E Z and g £ Q , we have a < 
L{z,g) < b for some reals a,b. Consider the SeqRand algorithm (see 
8) applied with 5\{Z,g,g') = X(b — a) 2 /8 and ir(p) = p. Its expected 
risk Ez™lE ff ~/t R(g), where we recall that fi denotes the randomizing 
distribution, satisfies 

E^E^ R(g) < min {E g ^ p R(g) + + f^}} (6.4) 

In particular, when Q is finite, by taking ir uniform on Q and A = 

\l (b-a) 2 (nll) > We 9et 

Ez«E^ A 11(g) - mm R(g) < (b - a)^^^ (6.5) 
Proof. From Hocffding's inequality, we have 

E g ^ Hp) logE 3 ^ p e HHz, g ')-L{z, g )] = i og ^ pe m g >~ Hp) Hz,9')-L(z, g )] 

< \ 2 (b~a) 2 
— 8 ' 

hence the variance inequality holds for 5\ = \(b — a) 2 /8 and n(p) = p. 
The result directly follows from Theorem 3.1. □ 

The standard point of view (see Appendix B) applies Hocffding's 
inequality to the random variable W = L(Z, g') — L(Z, g) for g and g' 
fixed and Z drawn according to the probability generating the data. 
The previous theorem uses it on the random variable W = L(Z,g') — 
E 9 ^p L(Z, g) for fixed Z and fixed probability distribution p but for g' 
drawn according to p. Here the gain is a multiplicative factor equal to 
2 (see Appendix B). 

6.2 Deviation inequalities 

For the comparison between Theorem 6.1 and Theorem 6.3 to be fair, 
one should add that (6.3) and (6.2) come from deviation inequalities 
that are not exactly obtainable to the author's knowledge with the 
arguments developed here. Precisely, consider the following adaptation 
of Lemma 5 of [60] . 

Lemma 6.5. Let A be a learning algorithm which produces the predic- 
tion function A(Z\) at timei + 1, i.e. from the data Z\ = (Z\, . . . , Zi). 
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Let C be the randomized algorithm which produces a prediction function 
C{Z™) drawn according to the uniform distribution on {A($),A(Zi), . . . ,A(Z™)}. 
Assume that sup z g g , \L(z, g) — L(z,g')\ < B for some B > 0. Condi- 
tionally to Zx, . . . ,Z n+ i, the expectation of the risk of L w.r.t. to the 
uniform draw is -^i'Y^^q R[A(Z\)\ and satisfies: for any rj > and 
e > 0, for any reference prediction function g, with probability at least 
1 — e w.r.t. the distribution of Z\, . . . , Z n+ \, 

^iT:UR[Azi)]-R{9) < ^EtoW^^ll-^s)} 

(6.6) 

where we still use V(g,g') = Ez{[L(Z, g) — L(Z,g')] 2 } for any predic- 
tion functions g and g' and <p(t) = for any t > 0. 

Proof. See Section 10.6. □ 

We see that two variance terms appear. The first one comes from 
the worst-case analysis and is hidden in J27=o A{Z\)\— L(Z i+ i, g)} 

and the second one comes from the concentration result (Lemma 10.1). 
The presence of this last variance term annihilates the benefits of our 
approach in which we were manipulating variance terms much smaller 
than the traditional Bernstein's variance term. 

To illustrate this point, consider for instance least square regres- 
sion with bounded outputs: from Theorem 4.2 and Table 1, the hid- 
den variance term is null. In some situations, the second variance 
term ■^p['Y^ = ( ) V[A(Z\),g\ may behave like a positive constant: for 
instance, this occurs when Q contains two very different functions hav- 
ing the optimal risk min gS g R(g). By optimizing 77, this will lead to a 
deviation inequality of order ?i -1 / 2 even though from (4.4) the proce- 
dure has n~ l -convergence rate in expectation. In [8, Theorem 3], in a 
rather general learning setting, this deviation inequality of order n -1 / 2 
is proved to be optimal. 

To conclude, for deviation inequalities, we cannot expect to do 
better than the standard-style approach since at some point we use 
a Bernstein's type bound w.r.t. the distribution generating the data. 
Besides procedures based on worst-case analysis seem to suffer higher 
fluctuations of the risk than necessary (see [8, discussion of Theorem 
3]). 

Remark 6.1. Lemma 6.5 should be compared with Lemma 4.3. The 
latter deals with results in expectation while the former concerns devi- 
ation inequalities. Note that Lemma 6.5 requires the loss function to 
be bounded and makes a variance term appear. 
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7 Application to L g -regression for unbounded 
outputs 

In this section, we consider the L g -loss: L(Z,g) = \Y — g(X)\ q . As 
a warm-up exercise, we tackle the absolute loss setting (i.e. q = 1). 
The following corollary holds without any assumption on the output 
(except naturally that if Ez\Y\ < +00 to ensure finite risk). 

Corollary 7.1. Let q = 1. Assume that sup gg g Ez g(X) 2 < b 2 for 
some b > 0. There exists an estimator g such that 

ER(g) mm R(g) < 2bJ^0f. (7.1) 

gey v 

Proof. Using E Z {[\Y - g(X)\-\Y - g'{X)\] 2 } < Ab 2 and Theorem 6.1, 
the algorithm considered in Theorem 6.1 satisfies Ei?(g)— min 9e g R{g) < 

2Afe 2 + xfa+i') ; which gives the desired result by taking A = y^j^^jy- 

□ 

Now we deal with the strongly convex loss functions (i.e. q > 1). By 
using Theorem 3.1 jointly with the symmetrization idea developed in 
the previous section allows to obtain new convergence rates in heavy 
noise situation, i.e. when the output is not constrained to have a 
bounded exponential moment. We start with the following theorem 
concerning general loss functions. 

Theorem 7.2. Assume that sup g& g x&x \g(x)\ < b for some b > 0, 
and that the output space is y = R. Let B > b. Consider a loss 
function L which can be written as L[(x,y), g] = £[y,g(x)], where the 
function £ : M x R — > M. satisfies: there exists Xq > such that for any 
y G [—B;B], the function y' 1— > e~ x °^ y ' v ) is concave on [—6; b]. Let 

A(y) = sup [£(y,a)-£(y,[3)}. 

\a\<b.\p\<b 

For A G (0; Ao], consider the algorithm that draws uniformly its predic- 
tion function in the set {E g ~7r_ AEo 9, ■ ■ ■ , ^g~ir-\ Sn 9}: an d consider 
the deterministic version of this randomized algorithm. The expected 
risk of these algorithms satisfy 

Ezpi^EILo^—^ 9) 

<Ez r ^£:U^(llW AEi g) 

<min {E g „ p R(g) + ^1} 

+v{^P a ixA{Y)<l;\Y\>B + [A(F) 



^]lAA(y)>i ; |y|>s}- 
(7.2) 
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Proof. See Section 10.7. □ 

Remark 7.1. For y £ [—B;B], concavity of y' i— > e~ x ° e ( y ' y ' on [— 6; b] 
for Ao > implies convexity of y' t— > £(y, y') on [—6; 6], 

In particular, for least square regression, Theorem 7.2 can be sim- 
plified into: 

Theorem 7.3. Assume that sup ge g xeX \g(x)\ < 6 /or some b > 0. 
For any < A < l/(8o 2 ), i/ie expected risk of the algorithm that draws 
uniformly its prediction function among E s ^ 7r _ AE g,. . . ,E s ~7r_ AE 3 is 
upper bounded by 



min <|E 



, p R(g) + 8A6 2 E(y 2 l |y|>(8A) - 1/2 ) + f^}}- (7.3) 



Proof. For any B > b, for any j/ g [— B; B], straightforward computa- 
tions show that y' i— > e~^°( y ~ v ^ is concave on [—6; 6] for Ao = -gprpFp' 
so that we can apply Theorem 7.2. We have A(y) = 4b\y\ for any |y| > b 
so that by optimizing the parameter B, we obtain that the expected 
risk of the algorithm is upper bounded by 



mm {E g . p R(g) + + E{ (4b\Y\ £)l|y|>( 4W0 -i } 

+E{8A6 2 y 2 l (2A) -i/2_ b< |y| <(4bA:) -i } 
< min {^ p R(g) + §^}+E{8\b^l lYm4bx) -i} 

+E{8Ao 2 F 2 l( 2A) -i/2_ b< |y| < (4 bA -|-i}, 

which gives the desired result. □ 

Theorem 7.3 improves [21, Theorem 1]. 
Corollary 7.4. Under the assumptions 

su Pg£G,x£X \g( x )\ < b f° r some b > 
E|F| S < A for some s > 2 and A > 
Q finite 

for A = Cl (!2£j£l) 2 /( s+2 > where Cl > and 7r the uniform distribution 
on Q , the expected risk of the algorithm that draws uniformly its pre- 
diction function among E s ^ 7r _ XE g,. . . ,E s ^ 7r _ XEn g is upper bounded 
by 

njaR^ + C^)'^ (7.4) 
gey 

for a quantity C which depends only on C\, b, A and s. 

Juditsky, Rigollet and Tsybakov proved that Corollary 7.4 can also 
be obtained through a simple adaptation of their original analysis (see 
[38, Section 4.1]). 
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Proof. The moment assumption on Y implies 

a s -m\Y\ q l lY \> a < A for any < q < s and a > 0. (7.5) 

As a consequence, the second term in (7.3) is bounded by 8A& 2 A(2X)^ S ~ 2 ^ 2 , 
so that (7.3) is upper bounded by mm geg R(g) + A2 2+s / 2 b 2 \ s / 2 + l -^p-, 
which gives the desired result. □ 

In particular, with the minimal assumption IKY 2 < A (i.e. s = 2), 
the convergence rate is of order n -1 / 2 , and at the opposite, when s 
goes to infinity, we recover the n~ x rate we have under exponential 
moment condition on the output. 

Using Theorem 7.2, we can generalize Corollary 7.4 to ^-regression 
and obtain the following result. 

Corollary 7.5. Let q > 1. Assume that 

su P g eg,xex\9( x )\ < b f° r some b >0 
E\Y\ S < A for some s> q and A > 
Q finite 

Let 7r be the uniform distribution on Q , C\ > and 




Cl (lSSM) (q ~ 1)/s whenq<s<2q-2 
Ci(] o^\y/(s + 2) whens > 2 q-2 



The expected risk of the algorithm which draws uniformly its prediction 
function among E g ^ 7r _ AE g, . . . , ^g^_ x -s n 9 is upper bounded by 

min Rig) + cflSsM) 1 -^ w h en q < s < 2q - 2 

minR(g) + C{^P-) 1 ~ 7 ^ whens>2q-2 

for a quantity C which depends only on C\, b, A, q and s. 

Proof. Sec Section 10.8. □ 

Remark 7.2. For q > 2, low convergence rates (that is n -7 with 7 < 
1/2) appear when the moment assumption is weak: E|y| s < A for 
some A > and q < s < 2q — 2. Convergence rates faster that the 
standard non parametric rates n~ x / 2 are achieved for s > 2q — 2. Fast 
convergence rates systematically occurs when 1 < q < 2 since for these 
values of q, we have s > q > 2q— 2. Surprisingly, for 5=1, the picture 
is completely different (see Section 8.4.2 for discussion and minimax 
optimality of the results of this section). 

Remark 7.3. Corollary 7.5 assumes that the prediction functions in Q 
are uniformly bounded. It is an open problem to have the same kind 
of results under weaker assumptions such as a finite moment condition 
similar to the one used in Corollary 7.1. 
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8 Lower bounds 



The simplest way to assess the quality of an algorithm and of its ex- 
pected risk upper bound is to prove a risk lower bound saying that no 
algorithm has better convergence rate. This section provides this kind 
of assertions. The lower bounds developed here have the same spirit as 
the ones in [20, 3, 16], [34, Chap. 15] and [6, Section 5] to the extent 
that it relies on the following ideas: 

• the supremum of a quantity Q(P) when the distribution P be- 
longs to some set V is larger than the supremum over a well 
chosen finite subset of V, and consequently is larger than the 
mean of Q{P) when the distribution P is drawn uniformly in the 
finite subset. 

• when the chosen subset is a hypercube of 2 m distributions (see 
Definition 8.1), the design of a lower bound over the 2 m distri- 
butions reduces to the design of a lower bound over two distribu- 
tions. 

• when a data sequence Z\ , . . . , Z n has similar likelihoods according 
to two different probability distributions, then no estimator will 
be accurate for both distributions: the maximum over the two 
distribution of the risk of any estimator trained on this sequence 
will be all the larger as the Bayes-optimal prediction associated 
with the two distributions are 'far away'. 

We refer the reader to [17] and [51, Chap. 2] for lower bounds not 
particularly based on finding the appropriate hypercube. Our analysis 
focuses on hypercubes since in several settings they afford to obtain 
lower bounds with both the right convergence rate and close to optimal 
constants. Our contribution in this section is 

• to provide results for general non-regularized loss functions (we 
recall that non-regularized loss functions are loss functions which 
can be written as L[(x,y),g] = t[y,g{x)] for some function i : 

• to improve the upper bound on the variational distance appearing 
in Assouad's argument, 

• to generalize the argument to asymmetrical hypercubes which, to 
our knowledge, is the only way to find the lower bound matching 
the upper bound of Corollary 7.5 for q < s < 2q — 2, 

• to express the lower bounds in terms of similarity measures be- 
tween two distributions characterizing the hypercube. 

• to obtain lower bounds matching the upper bounds obtained in 
the previous sections. 
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Remark 8.1. In [37], the optimality of the constant in front of the 
(log |£7|)/n has been proved by considering the situation when both \Q\ 
and n goes to infinity. Note that this worst-case analysis constant is 
not necessary the same as our batch setting constant. This section 
shows that the batch setting constant is not "far" from the worst-case 
analysis constant. 

Besides Lemma 4.3, which can be used to convert any worst-case 
analysis upper bounds into a risk upper bound in our batch setting, 
also means that any lower bounds for our batch setting leads to a lower 
bound in the sequential prediction setting (the converse is not true). 
Indeed the cumulative loss on the worst sequence of data is bigger than 
the average cumulative loss when the data are taken i.i.d. from some 
probability distribution. As a consequence, the bounds developed in 
this section partially solve the open problem introduced in [37, Section 
3.4] consisting in developing tight non-asymptotical lower bounds. For 
least square loss and entropy loss, our bounds are off by a multiplicative 
factor smaller than 4 (see Remarks 8.6 [p. 42] and 8.7 [p. 44]). 

This section is organized as follows. Section 8.1 defines the quanti- 
ties that characterize hypercubes of probability distributions and de- 
tails the links between them. Section 8.2 defines a similarity measure 
between probability distributions coming from /-divergences (see [31]) 
and gives their main properties. We give our main lower bounds in 
Section 8.3. These bounds are illustrated in Section 8.4. 

8.1 Hypercube of probability distributions 

Let to £ N* . Consider a family of 2 m probability distributions on Z 

{P„:a^(a 1 ,...,a m )e{-;+} m } 
having the same first marginal, denoted fi: 

P s (dX) = P (+! ... i+) (dX) 4 fj,(dX) for any a £ {-; +} m , 

and such that there exist 

• a partition Xq, . . . , X m of X, 

• functions hi and h 2 defined on X — X$ taking their values in y 

• functions p+ and p_ defined on X — Xq taking their values in 

[0;i] 

for which for any j 6 {1, . . . , to}, for any x £ Xj, we have 

P s (Y = h x (x)\X = x) = P<7j (x) = 1 - P B (Y = h 2 (x)\X = x), 

(8.1) 

and for any x £ Xq, the distribution of Y knowing X = x is indepen- 
dent of a (i.e. the 2™ conditional distributions are identical). 
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In particular, (8.1) means that for any ieZ- Xq, the conditional 
probability of the output knowing the input x is concentrated on two 
values and that, under the distribution P s , the disproportion between 
the probabilities of these two values is all the larger as p aj (x) is far 
from 1/2 for j the integer such that x 6 Xj. 

Remark 8.2. Equality (8.1) indirectly implies that for any x, hi(x) ^ 
h,2(x). This is not at all restricting since points for which we would 
have liked h\{x) = h2(x) can be put in the "garbage" set Xq. 

P[Y = h 1 (X)\X = x] 



P+ = 



i+i 

2 



1/2 - 



P- 



2 



1 1 1 1 1 1 1 1 , x 

A.Q A.\ A-2 <T»3 <^-4 «t»5 SLfy S17 <^-8 

Figure 3: Representation of a probability distribution of the hypercube. 
Here the hypercube is a constant and symmetrical one (see Definition 
8.2) with m = 8 and the probability distribution is characterized by 
o- = (+,-,+,-,-,+,+,-). 



Definition 8.1. The family of 2 m probability distributions will be 
referred to as an hypercube of distributions if and only if for any j € 
{1, . . . ,m}, 

• the probability fJ-(Xj) = fi(X G Xj) is independent of j, i.e. 
fJt{X\) = ■■■ = (i(X m ), 

• the law of (p + (X),p_(X),hi(X),h2(Xy) when X is drawn ac- 
cording to the conditional distribution p(u\Xj) = (J,(-\X £ Xj) is 
independent of j, i.e. the m conditional distributions are identi- 
cal. 

Remark 8.3. The typical situation in which we encounter hypercubes 
are when X C R d for some d > 1 and when we have translation 
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invariance to the extent that there exist t 2 , ■ ■ ■ , td in M. d such that for 
any j £ {2, . . . , to}, Xj = X\ + tj and for any x £ X±, 

(p+(x + tj),p-(x + tj),hi(x + tj),h 2 (x + tj)) 

= (p+(x),P^(x),h 1 (x),h 2 (x)). 

A special hypercube is illustrated in Figure 3. 

For any p £ [0; 1], yi £ y, y 2 G y an d y £ y, consider 

Vpvuvziy) -pt(yi,y) + (1-^(2/2,2/) ( 8 - 2 ) 

When j/i ^ y 2 , this is the risk of the prediction function identically 
equal to y when the distribution generating the data satisfies P[Y = 
yi] = p = 1 — P\Y = y 2 ]. The case yi = y 2 corresponds to P\Y = y\ = 
j/2] = land will not be of interest to us (since we will use this function 
for yi = h\{x) ^ h 2 {x) = y 2 ). ■ 

Through this distribution, the quantity 

<t>yi,vz(p) ~ in f <Pp,vi,v*(y) (8.3) 

can be viewed as the risk of the best constant prediction function. 
Remark 8.4. In the binary classification setting, when y = { — 1; +1}, 
from the B ayes rule, the function [iH>a minimizer of <fip(Y=i\x=x).-i,+i] 
is the best prediction function to the extent that it minimizes the risk 
R. Section 8.4.1 provides other typical examples of loss functions and 
[14] gives an exhaustive study of their links. 

For any q + and in [0; 1], introduce 

"0g+,g- ,1/1,5/2 ( a ) 

- ^i,9 2 N+ + (1 - a )<l-} - ot^ Vl ,y 3 (q+) - (1 - a)4> yim {q-) 

(8.4) 

Lemma 8.1. 1. For any y\ £ y, y 2 £ y, q+ £ [0; 1] and q_ £ 

[0; 1], the functions <f> Vl . y2 and ipq + ,q_,y u y a are concave, and con- 
sequently admit one-sided derivatives everywhere. The function 
"0ij + ,g_.j/i,j/2 * s non- negative. 
2. Define the function K a as K a (t) = [(1 — a)t] A [a(l — t)]. Let 
q_ = p_ Ap+ and q + = p_ Vp+. Assume that the function 4> yi , y2 
is twice differ entiable by parts on q + \ to the extent that there 
exist q- = /?o < Pi < ■ ■ ■ < fid < Pd+i = Q+ such that for any 
££{0,...,d}, 

< t'yi,v2 * s ^uiice differ entiable on \f3i\ /3c+i[. For any 
I £ {1, . . . , d}, let Ai denote the difference between the right-sided 
and left-sided derivatives of 4> yitV2 at point fy. We have 

^ P+ , P -MM( a ) = -(P+ -P-) 2 Jo KattWhuh* [ tp + + ^ - t )P-] dt 

-b+-p-IEti^*feH A * 
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In particular, we have 

1 1/1,3/2 (1/2) 

_ ji [t a (1 — t)]|^' [tp + + (1 - t)p_] \dt 



+|Eti(b+-^|A|A-P-|)|A,| 



(8.5) 



Proof. 1. The function 4> yi ,y 2 is concave since it is the infimum of 
concave (affine) functions. As a direct consequence, the function 
' ! /'p+,p-,i/i,2/2 is concave and non- negative. 
2. It suffices to apply the following lemma, that is proved in Section 
10.9, to the function / : t i— > <A yi ,y 2 te 5 + + (1 — *).P— L which has 
critical points on tt defined as t^p + + (1 — t^)p_ = (3e- 

Lemma 8.2. Let / : [0; 1] -> M fc a function twice differen- 
tiable by parts to the extent that there exist = to < ti < ■ ■ ■ < 
td < td+i = 1 such that for any i <E {0, . . . , d}, f is twice dif- 
ferentiable on]ti]te+i[- Assume that f is continuous and admits 
left-sided derivatives f[ and right-sided derivatives f r (tg) at the 
critical points tg. For any a € [0; 1], we have 

/(a) - af(l) - (1 - a)/(0) = - £ K a (t)f"(t)dt 

□ 

Definition 8.2. Let {P^ : <7 = (ai, . . . , <r TO ) 6 { — ; +} m } be a hyper- 
cube of distributions. 

1. The positive integer m is called the dimension of the hypcrcube. 

2. The probability iu = fi(X\) = ■■■ = fi(X m ) is called the edge 
probability. 

3. The characteristic function of the hypercube is the function ip : 
W.+ — ► K + defined as for any u S R + 

^(•u) = \rn{u+\)¥, x ^^[lxex 1 % + {x).p-{x)M{x)M{x){- u J Tl) 



'■(u+ l)E f4( .| Afl )V'p + ,p_,/, ll h i! (^ T ) in short. 
The edge discrepancies of type I of the hypcrcube are 

1 * - SS , =E M (.|x I )^P+,P-,h I ,fc a ( 1 / 2 ) f 86 ) 
\ dj = E Al( .|^ l) ( P+ -p_) 2 

The edge discrepancy of type II of the hypcrcube is defined as 

2 



dn ± E K .\ Xl) Wp+(1 - p-) - ^(1 " P+)P- ] • ( 8 - 7 ) 
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A probability distribution Pq on Z satisfying Po(dX) = fi(dX) 
and for any x £ X - X , P [Y = h x (x)\X = x] = \ = P [Y = 
Ji2{x)\X = x\ will be referred to as a base of the hypercube. 

Let Po be a base of the hypercube. Consider distributions Pr^i , a £ 
{ — , +} admitting the following density w.r.t. Po: 

2p rT {x) when x £ X\ and y = h\(x) 

-p^-{x,y) = < 2[1 — p,j(x)] when x & X% and y = /i2(a;) 



1 otherwise 



The distributions P[_] and P[ + ] will be referred to as the repre- 
sentatives of the hypercube. 

8. When the functions p+ and p_ arc constant on Xi, the hypercube 
will be said constant. 

9. When the functions p+ and p_ satisfies p+ = 1 — p_ on X — Xq, 
the hypercube will be said symmetrical. In this case, the function 
2p + — 1 will be denoted £ so that 

Otherwise it will be said asymmetrical. 

The edge discrepancies are non-negative quantities that are all the 
smaller as p- and p+ become closer. Let us introduce the following 
assumption. 

Differentiability assumption. For any x € X\, the function 4>h 1 (x).h^(x) 
is twice differentiable and satisfies for any t £ \p-(x) Ap+(x);p-(x) V 

ICcxKfa^Wl >C (8.9) 

for some £ > 0. 

When y C K., the differentiability assumption is typically fulfilled 
when for any j/i ^ y2, the functions y i— > t{y\,y) and y i— > £(y2,y) 
admit second derivatives lower bounded by a positive constant and 
when these functions are minimum for respectively y = yi and y = y2- 
This is the case for least square loss and entropy loss, but it is not the 
case for hinge loss, absolute loss or classification loss. The following 
result gives the main properties of the characteristic function tp and 
useful lower bounds of it. 

Lemma 8.3. The characteristic function of the hypercube is a concave 
nondecreasing function and satisfies 

• -0(0) = 
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• ip(u) > (u A l)ip(l) 

• Under the differentiability assumption (see (8. 9) ), we have 

where we recall that d'j = K fJ/ ( t \x 1 )(p+ — p-) 2 . In particular, we 
have 

di > ld[. (8.10) 

Proof. From Lemma 8.1, the function ip is non- negative and concave. 
Therefore the characteristic function is also concave and non-negative 
on IR+. Consequently, it is nondecreasing. The remaining assertions of 
the lemma are then straightforward. □ 

To underline the link between the discrepancies of types I and II, 
one may consider (8.10) jointly with the following result 

Lemma 8.4. When a hypercube is constant and symmetrical, i.e. 
when on X\ p + = and p- = for £ constant, we have 

d[ = d u =e- 

Finally, since the design of constant and symmetrical hypercubes 
is the key of numerous lower bounds, we use the following: 

Definition 8.3. A (m, w, (Zii)-hypercubc is a constant and symmet- 
rical TO-dimcnsional hypercube with edge probability w and edge dis- 
crepancy of type II equal to du, and for which p + > 1/2 and h\ and 
h% are constant functions. 

For these hypercubes, we have m = m, w = w , du = d\\ and 

£ = VdYi 

p- EE 1=£H 

p + - ^ 



and from (8.5), when the function (j)^,^ is twice differentiable on 

}p-;p+l 

di = ^f Jo [* A (1 - t)]\<j>l uh2 + Vdnt) | dt. (8.11) 



8.2 /-similarity 

Let us introduce a similarity measure between probability distribu- 
tions. When a probability distribution F is absolutely continuous w.r.t. 
another probability distribution Q, i.e. P<Q, | denotes the density 
of P w.r.t. Q. 
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Definition 8.4. Let / : K + — ► K + be a concave function. The f- 
similarity between two probability distributions is denned as 

*,<*.<»-{ ™5>« Ztl <-> 

Equivalently, if p and q denote the density of P and Q w.r.t. the 
probability distribution (P + Q)/2, one can define the /-similarity as 

Sf(V,Q) = f 9>0 f($d® 

It is called /-similarity in reference to /-divergence (see [31]) to 
which it is closely related. Precisely, introduce the function / = /(l) — 
/. / is convex and satisfies /(l) = 0. Thus it is associated with an 
/-divergence, which is defined as 

D /M ) = R{p« ««« (8.13) 
J 1/(0) otherwise 

Then we have S f (P, Q) = /(l) - D f -(¥, Q). 

Here we use /-similarities since they are the quantities that natu- 
rally appear when developing our lower bounds. As the /-divergence, 
the /-similarity is in general asymmetric in P and Q. Nevertheless 
for a concave function / : K + — * M. + , one may define a concave func- 
tion /* : K + -> R + as f*(u) = uf(l/u) and /*(0) = lim„^ uf(l/u), 
and we have (see [31] for the equivalent result for /-divergence): when 
P < Q and <Q < P, 

5 / (P,HJ)=5 / .(Q,P). (8.14) 

We will use the following properties of /-similarities. 

Lemma 8.5. Let P and Q 6e two probability distributions on a mea- 
surable space (£,B) such that P<Q. 

1. Let f and g be non-negative concave functions defined on R + 
and let a and b be non-negative real numbers. For any probabil- 
ity distributions P and Q, we have iS a /+& g (P, Q) = aSf(P, Q) + 
bS g {W,Q). Besides if f < g, then «S/(P,Q) < S g (P,Q). 

2. Let A G B such that £ = 1 on A. Let A c = £ - A. Let f : 
M + — * R + be a concave function. Let P' and Q 1 be probability 
distributions on £ such that 

• P'«Q' 

• |p = 1 on A. 

• P' = P and Q' = Q on A c , i.e. /or any B e B, V'(BnA c ) = 
F(B n A c ) and Q'(B n A c ) = Q(B n A c ). 
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We have 

5 / (P',Q') = 5/(P,Q). 

3. Let £' be a measurable space and fi be a a-finite positive measure 
on £' . Let {f v }ve£' be a family of non-negative concave functions 
defined on R+ such that (u,v) i— > fv(^j(u)) is measurable. We 
have 

J £l S fv (P, Q)fi(dv) = Sj el Mdtt) (P, Q). (8.15) 

Proof. 1. It directly follows from the definition of /-similarities. 

2. We have 

S f (T,V) = J A f$)d# + f A .f{$)dQ 

= /(l)Q(^) + / A ./(g)dQ 
= S f (V,Q) 

3. This Fubini's type result follows from the definition of the integral 
of non-negative functions on a product space. 

□ 

8.3 Generalized Assouad's lemma 

We recall that the n-fold product of a distribution P is denoted P® n . 
We start this section with a general lower bound for hypercubes of dis- 
tributions (as defined in Section 8.1). This lower bound is expressed in 
terms of a similarity (as defined in Section 8.2) between n-fold products 
of representatives of the hypercube. 

Theorem 8.6. Let V be a set of probability distributions containing a 
hypercube of distributions of characteristic function ip and representa- 
tives P[_] and P[ + ] . For any training set size n € N* and any estimator 
g, we have 

sup {ER(g) min R(g) } > (P®» P^) (8 . 16) 

where the minimum is taken over the space of all prediction functions 
and KR(g) denotes the expected risk of the estimator g trained on a 
sample of size n: ER(g) = E Z n^ P » n R(gz^) = E Z n^ P ®„ E( X ,y)~p &\y,gz™{X)\. 

Proof. Sec Section 10.10. □ 

This theorem provides a lower bound holding for any estimator and 
expressed in terms of the hypercube structure. To obtain a tight lower 
bound associated with a particular learning task, it then suffices to find 
the hypercube in V for which the r.h.s. of (8.16) is the largest possible. 
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By providing lower bounds of S^{P^, P®") that arc more explicit 
w.r.t. the hypcrcube parameters, we obtain the following results that 
are more in a rcady-to-use form than Theorem 8.6. 

Theorem 8.7. Let V be a set of probability distributions containing 
a hypercube of distributions characterized by its dimension m, its edge 
probability w and its edge discrepancies d\ and d\\ (see Definition 8.2). 
For any estimator g and training set size n E N*, the following asser- 
tions hold. 

1. We have 

sup [ER(g) - min R{g)\ > mwdi (l - ^/l — [1 — d n } nw ) 
pev s 

> mwdi(l — y/nwdjij. 

(8.17) 

2. When the hypercube is constant and symmetrical (see Definition 
8.2) , we also have 

sup{Ei?(.g) - min R(g)} > mwdi{p(\N\ > yftSfc) - d 1 ^) 

(8.18) 

for N a centered gaussian random variable with variance 1 . 

3. When the hypercube satisfies p + = 1 = 1 — we also have 

sup VER(g) — min R{g)\ > mwd\{\ — w) n /g ig\ 

p^P 9 \ ■ ) 

4- When the hypercube is constant and symmetrical and when the 
differentiability assumption (see {8.9)) holds, we also have 

snp\ER{g) - min R{g)\ > 
Pev 9 

x{i+i[i-(i-vr^H n -i[i+(^a-i)- 

(8.20) 

Proof. See Section 10.11. □ 

The lower bounds (8.17), (8.18) and (8.20) are of the same nature. 
(8.17) is the general lower bound having the simplest form. For con- 
stant and symmetrical hypercubes, it can be refined into (8.18) and, 
when the differentiability assumption holds, into (8.20). These refine- 
ments mainly concern constants as we will see in Section 8.4.3. Finally, 
(8.19) is less general but provide results with tight constants when con- 
vergence rate of order n _1 has to be proven (see Remarks 8.6 [p. 42] 
and 8.7 [p.44]). 

To better understand the link between (8.17), (8.18) and (8.20), 
the following corollary considers an asymptotic setting in which n goes 
to infinity and the parameters of the hypercube varies with n (which 
is the typical situation even for finite sample lower bounds). 
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Corollary 8.8. Let a > and N be a centered gaussian random vari- 
able with variance 1. Under the assumptions of item 4 of the previous 
theorem, we have 

di > |<fa (8-21) 
and (8.17), (8.18) and (8.20) respectively lead to 



liminf jJLjj sup {Ei?(g) - min i?( 5 )} > 1 - VT= 



e 



!™ in , f SU P { Ei? (9) " min TO} > p (l^l > v/a) 

->U,nKJaii — ►a P^P 9 



22) 
(8.23) 



and 



rf h n min / ^317 sup {Ei?(5)- min ii(g)}>l 



a/2 



2 2 

(8.24) 

Proof. It follows from Theorem 8.7 and (l+x) 1 ^ x — > e when a; — > 0. □ 

Inequality (8.21) leads to (slightly) weakened versions of (8.22) and 
(8.23) that can be directly compared with (8.24) (see Figure 4). A 
numerical comparison of these bounds is given in Section 8.4.3. 

Remark 8.5. The previous lower bounds consider deterministic estima- 
tors (or algorithms), i.e. functions from the training set space U n >oZ n 
to the prediction function space Q. They still hold for randomized 
estimators, i.e. functions from the training set space to the set T> of 
probability distributions on Q. 



8.4 Examples 

Theorem 8.7 motivates the following simple strategy to obtain a lower 
bound for a given set V of probability distributions and a reference 
set Q of prediction functions: it consists in looking for the hypercube 
contained in the set V and for which 

• the lower bound is maximized, 

• for any distribution of the hypercube, Q contains a best prediction 
function, i.e. min g i?(g) = mm ge g R(g) . 

In general, the order of the bound is given by the quantity mwdi (or 
mw(du in the case of (8.20)) and the quantities w and da are taken 
such that nwda is of order 1. 

In this section, we apply this strategy in different learning tasks. 
Before giving these lower bounds (Section 8.4.2), Section 8.4.1 stresses 
on the influence of the loss function in the computations of the edge 
discrepancy d\ and the constant £ of the differentiability assumption 
(8.9). 
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Figure 4: Comparison of the r.h.s. of (8.22), (8.23) and (8.24) 



8.4.1 Edge discrepancy d\ and constant Q of the differ- 
entiability assumption 

All the previous lower bounds rely on either the edge discrepancy dj 
or the constant £ of the differentiability assumption (that has been 
introduced to control d\ in a simple way). The aim of this section is 
to provide more explicit formulas of these quantities for different loss 
functions. 

To obtain the formula for d\, we essentially use (8.6) and (8.5) 
jointly with the explicit computation of the second derivative of the 
function <f>. 

Entropy loss. Here we consider y = [0; 1] and the loss for predic- 
tion y' instead of y is £(y, y') = K(y, y'), where K(y, y') is the Kullback- 
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Leibler divergence between Bernoulli distributions with respective pa- 
rameters y and y' , i.e. K(y,y') = ylog {^) + (1 - y) log (jE%t)- Let 
H(y) denote the Shannon's entropy of the Bernoulli distribution with 
parameter y, i.e. 

H(y) = -y logy - (1 - y) log(l - y). (8.25) 
Computations lead to: for any p £ [0; 1], 

<thixM = H {PVi + (1 - -pH(yi) - (1 -p)H(y 2 ), (8.26) 
hence 

a/' ( n ) — (yi-V2) 2 

VvuVzW) Ipyi + (l-p)y2][(p(l-yi)+(l-p)(l-y2)] ' 

This last equality is useful when one wants to compute £ satisfying 
(8.9). 

Classification loss. In this setting, we have |3^| < +oo and the 
loss incurred by predicting y' instead of the true value y is £(y, y') = 
lyjty'. In this learning task, we have <j> yi , y2 (p) = [p A (1 — p)}l yi ^ V2 and 
(t>l uy2 = on [0; 1] - {1/2}. Then (8.6), (8.5) and Remark 8.2 [p.28] 
lead to 

di = E^.\ Xl ){[\p+ - || A \p- - ||]l(p + _i )(p __i )<0 }. 

Binary classification losses (or regression losses when the 
output is binary). In this setting, we have ^^lU {— oo; +oo}, 
but we know that P(Y € {— 1;+1}) = 1. So we are only interested 
in hypercubes of distributions satisfying this constraint, i.e. such that 
for any x £ X, h\(x) and h,2{x) belong to {— 1;+1}. In this setting, 
a best prediction function g, i.e. a measurable function from X to 
y minimizing R(g) = K£[Y,g(X)], is determined by the regression 
function: 

rj(x) = P{Y = +1\X = x). 
Let sign(x) = 1 x >q — l x <o be the sign function on K. 

• M-Classification loss. The loss function is £(y, y') = l yj /<o and 
a best prediction function is g*(x) = sign(?7(x) — 1/2). Without 
surprise, we recover the same formulae as for the classification 
loss. 

• Hinge loss. The loss function is £(y, y') = (l—yy')+ = max{0; 1— 
yy'} and a best prediction function is g*{x) = sign(?7(a;) — 1/2). 
For any yi, ?/2 € {-1; +1}, we have <j> VuVi {p) = 2[pA(l-jj)]l !/1 ^ Ua 
and c% ltVl = on [0; 1] - {1/2}. Then (8.6), (8.5) and Remark 
8.2 [p.28] lead to 

di = 2E M .|* l) {[|p + - || A |p_ - ||]l (p+ _i )(p __i )<0 }, 
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• Exponential loss (or AdaBoost loss). The loss function is 
^{ViV') = e ~ vv ■ F° r any Vi 7^ 2/2 £ { — 1;+!} and any p g 
[0; 1], the function f p , yi ,y 2 is minimized for y = ^- log (j^j), so 

a best prediction function is g*(x) = | log ( i^wi) ) • We obtain 

^vimip) = 2 VpU~P) 1 yi^y2 and 

C«M = _ 2[p(i- P )] 3 / 2 -'-'yi^2- 

In this setting, to obtain a lower bound of di, one has typically 
to compute C satisfying (8.9) and to use (8.10). 

• Logit loss. The loss function is £(y,y') — log(l + e~ yy ). For 
any j/i 7^ j/2 € { — 1; +1} and any p S [0; 1], the function f P , yi , y . 2 
is minimized for y = y\ log (5-^) , so a best prediction function is 

g*(x) = log(j^^j). We obtain (j) yuy2 (p) = H(p)l yi ^ y2 , where 
H{p) denote the Shannon's entropy of the Bernoulli distribution 
with parameter p (see (8.25)). We get 

^'yi ,3/2 W = ~ p{i- P )^vi^V2- 

Once more, to obtain a lower bound of d\, one has typically to 
compute C satisfying (8.9) and to use (8.10). 

Lg-loss. We consider y = K and the loss function is £(y,y') = \y — 
y'\ q with q > 1. The values q = 1 and q = 2 respectively correspond 
to the absolute loss and the least square loss. 

• Case q = 1 : Due to the lack of strong convexity of the loss 
function, the absolute loss setting differs completely from what 
occurs for q > 1 and appears to be similar to the classification and 
hinge losses settings. Indeed computations lead to 4> yi y2 (p) = 
\pA (l-p)]\y 2 - yi\ and 4>; UV2 = on [0; 1] - {1/2}. Then (8.6) 
and (8.5) lead to 

di=E M( .|* {|/i2-fci|[|p + - || A \p_ - i|]l (p+ _i )(p __i )<0 }. 

(8.27) 

• Case q > 1 : Tedious computations put in Appendix A lead to: 
for any p € [0; 1], 

A>i,ib(p) =p(1-p) t 1 '"'^ii,-. (8.28) 

and 

C ,,2 (P) = - Wl - P)l ^ r V"'^ (8-29) 

To obtain a lower bound of di, as for the entropy loss, one has to 
compute C satisfying (8.9) and to use (8.10). 
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• Special case q = 2 : For the least square setting, the formulae 
simplify into <p Vuy2 (p) = p(l-p)\y 2 -yi\ 2 and <f>y uy2 (p) = -2\y 2 - 
yi\ 2 . Then the edge discrepancy d\ can be written explicitly as 

di = \^^. ]Xl) {{p + -p-) 2 {h 2 - hi) 2 }. (8.30) 



8.4.2 Various learning lower bounds 

Before giving learning lower bounds matching up to multiplicative 
constants the upper bounds developed in the previous sections, wc 
will start with two standard problems: classification lower bounds for 
Vapnik-Cervonenkis classes and uniform universal consistency. 



Binary classification. We consider y = {0; 1} and l(y,y') = 
lyjty' ■ Since the work of Vapnik-Cervonenkis [55] , several lower bounds 
have been proposed and the most achieved ones are given in [33, Chap- 
ter 14]. The following theorem provides a significant improvement of 
the constants of these bounds. 

Theorem 8.9. Let L G [0; 1/2], tiGN and Q be a set of prediction 
functions of VC-dimension V > 2. Consider the set Vl of probability 
distributions on X x {0; 1} such that inf se g R(s) = L. For any esti- 
mator g: 

• when L = 0, there exists P G Vo for which 

ER(g) - inf R(g) > { ff^ «*T " > V ~ 2 . (8.31) 

g£G L %\ v) otherwise 

• when < L < 1/2, there exists P G Vl for which 

{ J L(V-V w h (l-2L) 2 n > 4 

Ei?(.g) - inf R(g) > { V V ~^uT~ wtlen V ^ 9 . 

ge g ^ }-2L otherwise 

(8.32) 

• there exists a probability distribution for which 

ER(g)-MR(g)>^ (8.33) 

Sketch. We have 4>y lt y 2 {p) = \p A (1 — p)]lj/i/j/ 2 an d f° r constant sym- 
metrical hypercubes a\ = \fdn/2. Then (8.31) comes from (8.19) and 
the use of a (V— 1, l/(n+l), l)-hypercube and a (V, 1/V, l)-hypercube. 

To prove (8.32), from (8.17) and the use of a (V - 1, f^)- 
hypercube, a (V— 1, 9n ^ 1 t 2 L) i > (l~2L) 2 )-hypercube and a (V, 1/V, (1 — 
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2L) 2 )-hypercube, we obtain 



MX^ll w hen (1 ~ 2L) " > k V (1 ~ 2£) 

32n Wlieil y _ 1 ^ 2 V g£ 

Sfl($)-i|fi*(ff)>< s^^T) when £^f2 > | 
^(l-V^P^) always 

which can be weakened into (8.32). Finally, (8.33) comes from the last 
inequality and by choosing L such that 1 — 2L = ^ -JV/n. □ 



No uniform universal consistency for general losses. This 
type of results is well known and tells that there is no guarantee of doing 
well on finite samples. In classification setting, when the input space is 
infinite, i.e. \X\ = +oo, by using a ( |_7XCkJ , 1/L^aJ , l)-hypercube with a 
tending to infinity, one can recover that: for any training sample size 
n, "any discrimination rule can have an arbitrarily bad probability of 
error for finite sample size" ([32]), precisely: 

infsup{P[F ^g(X)} -minPfY ^ g(X)}\ = 1/2, 

9 V 9 

where the infimum is taken over all (possibly randomized) classification 
rules. For general loss functions, as soon as \X\ = +oo, wc can use 
(L^aJ , l/LwaJ , l)-hypcrcubes with a tending to infinity and obtain 

mfsup{ER(g)- inf R(g)} > sup ipi, , Vl , V2 (1/2), (8.34) 
9 v 9&G yi,y2ey 

where tp is the function defined in (8.4). 



Entropy loss setting. Wc consider y = [0; 1] and £(y, y') — 
K(y,y') (see p. 37). We have seen in Section 4 that there exists an 
estimator g such that 

ER(g) - mm R(g) < ™ (g 35) 

gey 

The following consequence of (8.19) shows that this result is tight. 

Theorem 8.10. For any training set size n £ N* , positive integer d 
and input space X containing at least \\og 2 (2d)\ points, there exists 
a set Q of d prediction functions such that: for any estimator g there 
exists a probability distribution on the data space X x [0; 1] for which 

ER(g) - mm R(g) > e - 1 (log2)(l A 
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Proof. We use a (m, A A, l)-hypercube with m = Ll°g2 I^IJ = 
L^fij , /i! = and h 2 = 1. From (8.4), (8.6) and (8.26), we have 

di = ^i,o,o,i(l/2) = 0o,i (1/2) = ^(1/2) = log2. 
From (8.19), we obtain 

ER(g) mm R(g) > (^mMI a l)(log2)(l - ^ A 

Then the result follows from [1 - l/(n+ 1)]" \ e _1 . □ 

Remark 8.6. For |£| < 2" +2 , the lower bound matches the upper bound 
(8.35) up to the multiplicative factor e « 2.718 . For |C?| > 2"+ 2 , the 
size of the model is too large and, without any extra assumption, no 
estimator can learn from the data. To prove the result, we consider 
distributions for which the output is deterministic when knowing the 
input. So the lower bound does not come from noisy situations but 
from situations in which different prediction functions are not sepa- 
rated by the data to the extent that no input data falls into the (small) 
subset on which they are different. 



Lq-regression with bounded outputs. We consider y = [-B; B] 
and £(y,y') = \y — y'\ q (see p. 39). The following two theorems are 
roughly summed up in Figure 5 that represents the optimal conver- 
gence rate for L g -regrcssion. 



Case 1 < q < 1 + J L1 ° g 4 2 J g|J A 1: From (6.5), there exists an 
estimator g such that 



ER(g) - min R{g) <2^r 1 B"J (8.36) 

The following corollary of Theorem 8.7 shows that this result is 
tight. 

Theorem 8.11. Let B > and d E W . For any training set 
size n £ N* and any input space X containing at least \\og 2 d\ 
points, there exists a set Q of d prediction functions such that: 
for any estimator g there exists a probability distribution on the 
data space X x [-B; B] for which 



ER(g) - mm R(g) > 
where 



f ee | 2c q B q otherwise 




if ?=1 

if !<<?<!+ V /I %F A 1 



42 



1/2 
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U 



Figure 5: Influence of the convexity of the loss on the optimal convergence 
rate. Let c > 0. We consider L g -losses with q = 1 + c( log J^ ) n for u > 0. 
For such values of q, the optimal convergence rate of the associated learning 
task is of order (— ^j^-) V with 1/2 < v < 1. This figure represents the value 
of u in abscissa and the value of v in ordinate. The value u = corresponds 
to constant q greater than 1. For these q, the optimal convergence rate is 
of order n" 1 while for q = 1 or "very close" to 1, the convergence rate is of 
order ra" 1 / 2 . 



Proof. See Section 10.12. □ 

Case q > 1 + ^ L '° s 4 2 J g|J Al : We have seen in Section 4 that 
there exists an estimator g such that 

ER(g) mm i?(.g) < (log 2) IsSlM (8 . 37) 

The following corollary of Theorem 8.7 shows that this result is 
tight. 

Theorem 8.12. Let B > and d £ N*. For any training set 
size n € N* and input space X containing at least [log 2 (2d)J 
points, there exists a set Q of d prediction functions such that: 
for any estimator g there exists a probability distribution on the 
data space X x [-B; B] for which 

ER(g) mm R(g) > V e"^^ A l). 
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Proof. See Section 10.12. 



□ 



Remark 8.7. For least square regression (i.e. q=2), Remark 
8.6 [p. 42] holds provided that the multiplicative factor becomes 
2elog2 as 3.77. More generally, the method used here gives close 
to optimal constants but not the exact ones. We believe that this 
limit is due to the use of the hypercube structure. Indeed, the 
reader may check that for hypercubcs of distributions, the upper 
bounds used in this section are not constant-optimal since the 
simplifying step consisting in using min pe x ■ ■ • < min ge g • • • is 
loose. 

The reader may recover that there are essentially two classes of 
bounded losses: the ones which are not convex or not enough con- 
vex (typical examples are the classification loss, the hinge loss and the 
absolute loss) and the ones which are sufficiently convex (typical ex- 
amples are the least square loss, the entropy loss, the logit loss and the 
exponential loss). For the first class of losses, the edge discrepancy of 
type I is proportional to v<Ai f° r constant and symmetrical hypercubcs 
and (8.17) leads to a convergence rate of -J (log |<?|)/n. For the second 
class, the convergence rate is (log and the lower bound can be 

explained by the fact that, when two prediction functions are different 
on a set with low probability (typically n _1 ), it often happens that the 
training data has no input points in this set. For such training data, 
it is impossible to consistently choose the right prediction function. 

This picture of convergence rates for finite models is rather well- 
known, since 

• similar bounds (with looser constants) were known before for 
some cases (e.g. in classification, see [55, 33]). 

• mutatis mutandis, the picture exactly matches the picture in the 
individual sequence prediction literature: for mixable loss func- 
tions (similar to "sufficiently convex"), the minimax regret is 
0(log |£/|)/n, whereas for 0/1-type loss functions, it is 0(\/ (log \ G\)I 
(see e.g. [37]). 

Ly-regression for unbounded outputs having finite mo- 
ments. 

• Case q = 1 ; From (7.1), when sup ge gEzg(X) 2 < b 2 for some 
b > 0, there exists an estimator for which 

ER{g) - mm geg R(g) < 2by/(21og\G\)/n. 

The following corollary of Theorem 8.7 shows that this result is 
tight. 
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Theorem 8.13. For any training set size n G N*, positive in- 
teger d, positive real number b and input space X containing at 
least [log 2 d\ points, there exists a set Q of d prediction functions 
uniformly bounded by b such that: for any estimator g there exists 
a probability distribution for which E\Y\ < +00 and 



Ei?(g)-mini?( ff )>WiM^Ai 
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Proof. Let fh = Llog 2 | ^ | J . We consider a (m, 1/m, J ^ A l)- 
hypercube with h\ = —b and /12 = b. One may check that a\ = 
b\fdu so that (8.17) gives that for any estimator there exists a 
probability distribution for which E|Y| < +00 and 



ER(g) - mm R(g) > A l(l - ^\ A 

hence the desired result. □ 

• Case q > 1 : First let us recall the upper bound. In Corollary 
7.5, under the assumptions 

su Pgeg,x£x\9{ x )\ < b for some 6 > 
E\Y\ S < A for some s > q and A > 
Q finite 

we have proposed an algorithm satisfying 

. „, x . / wheng<s<2g- 
" \ ^(IheM) 1 -— when S >2< Z -2 

for a quantity C which depends only on b, A, q and s. 

The following corollary of Theorem 8.7 shows that this result is 

tight and is illustrated by Figure 6. 

Theorem 8.14. Let d G N*, s > q > 1, b > and A > 0. 

For any training set size n G N* and input space X containing 
at least [log 2 (2d)J points, there exists a set Q of d prediction 
functions uniformly bounded by b such that: for any estimator g 
there exists a probability distribution on the data space X x R for 
which E\Y\ S < A and 



ER(g) - min R(g) > 

for a quantity C which depends only on the real numbers b, A, q 
and s. 
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Both inequalities simultaneously hold but the first one is tight 
for q < s < 2q — 2 while the second one is tight for s > 2q — 2. 
They are both based on (8.17) applied to a [log 2 J -dimensional 
hypercubes. 

Contrary to other lower bounds obtained in this work, the first 
inequality is based on asymmetrical hypercubes. The use of this 
kind of hypercubes can be partially explained by the fact that 
the learning task is asymmetrical. Indeed all values of the out- 
put space do not have the same status since predictions are con- 
strained to be in [—6; b] while outputs are allowed to be in the 
whole real space (see the constraints on the hypercube in the 
proof given in Section 10.13). 




Figure 6: Optimal convergence rates in L q -regression when the output has 
a finite moment of order s (see Theorem 8.14)- The convergence rate is 
of order { l -^) v with < v < 1. The fi gure represents the value of s in 
abscissa and the value of v in ordinate. Two cases have to be distinguished. 
For 1 < q < 2 (figure on the left), v depends smoothly on q. For q > 2 
(figure on the right), two stages are observed depending whether s is larger 
than 2q — 2. 

8.4.3 Numerical comparison of the lower bounds (8.17), 
(8.18) and (8.20) 

To compare (8.17), (8.18) and (8.20), we will compare their asymptot- 
ical version, i.e. (8.22), (8.23) and (8.24). 



46 



Classification in VC classes. Wc consider \y\ = 2 and £(y, y') = 

ly^yi. Since the differentiability assumption does not hold in this set- 
ting, we only compare (8.17) and (8.18). 

Theorem 8.15. LetV be the set of all probability distributions on the 
data space Z. Let (G n ) n <£N be a family of prediction function spaces 
of VC-dimension V n > 2 satisfying n/V n — ► +oo. For any algo- 

n — >+oo 

rithm g: 



liminf . f- sup{E Z nR(g™)- inf R(g)} > 



ai from (8.22) 
a 2 from (8.23) 



with 



ai = max ^ 0.135 

a>0 1 

a 2 = max ./^ f + °° e^' 2 / 2 * w 0.170 
a>0 v 27r Ja 



In particular for a given set Q of finite VC-dimension V , for n suffi- 
ciently large, any estimator g satisfies 



sup{Ei?(.g) - inf g6a R(g)} > \^/Vjn~. 
Pev 

Proof. It suffices to apply Corollary 8.8 to (V n , 1/V n , aV n /n)-hypercubcs. 
use that d\ = \/dn/2 and choose the real number a > to maximize 
the lower bound. □ 

The two inequalities, coming from (8.22) and (8.23), simultaneously 
hold. They only differ by a multiplicative constant. 

Least square regression with unbounded outputs sat- 
isfying Ey 2 < A for some A > 0. We consider the context of 
Corollary 7.4 with s = 2. The best explicit constant in (7.4) is obtained 

from (7.3) for A = y^gf^ A (86 2 )- 1 . When log|£| < An/(8b 2 ), 
wc get 



R(g) - mm R(g) < y/^by/AyJ^sM 

Now let us give the associated lower bounds coming from (8.22), (8.23) 
and (8.24). 

Theorem 8.16. Let X be an infinite input space. Let A > and V 
be the set of probability distributions on X x K such that EY 2 < A. 
There exists a family of prediction function spaces (G n )n<£N such that 
any prediction function in these sets is uniformly bounded by b, their 
sizes at most grows subexponentially, i.e. n/log|C/„| goes to infinity 
when n goes to infinity, and for any algorithm g 



liminf ^Egfoy sup {E z? R(g zf ) - min R(g)} >{ /3 2 b^A from (8.23) 




from (8.22) 
from (8.23) 
from (8.24) 
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with 



Pi = (log 2)- 1 max Q>0 y/a(l - \/i - e- a ) « 0.3897 
ih = (log2)- 1 max a>ov /f / Q +0 ° e - t2 /2^« 0.4904 
/3 3 = (log 2)- 1 max a>0 Va(l + ± e - a / 2 - ±e 3a / 2 ) « 0.5154 



The three inequalities, coming from (8.22), (8.23) and (8.24), only 
differ by a multiplicative constant. The one coming from (8.24) gives 
the tightest result. The difference between the upper bound and this 
lower bound is a multiplicative factor smaller than 11. 

Remark 8.8. In all these examples (i.e. the ones of Section 8.4), we 
have only considered constant hypcrcubcs. The use of non-constant 
hypercubes can be required when smoothness assumptions are put on 
the regression function 77 : x 1— > P(Y = 1\X = x). This is typically 
the case in works on plug-in classifiers ([2, 9]). For instance, the proof 
of [9, Theorems 3.5 and 4.1] relies on non-constant symmetrical hy- 
percubes for which the function £ (sec Definition 8.2) is chosen such 
that it vanishes on the border of the partition cells, which ensures the 
regularity of the regression function rj. 

9 Summary of contributions and open prob- 
lems 

This work has developed minimax optimal risk bounds for the general 
learning task consisting in predicting as well as the best function in a 
reference set. It has proposed to summarize this learning problem by 
the variance function appearing in the variance inequality (p. 7). The 
SeqRand algorithm (Figure 1) based on this variance function leads to 
minimax optimal convergence rates in the model selection aggregation 
problem, and our analysis gives a nice unified view to results coming 
from different communities. 

In particular, results coming from the online learning literature are 
recovered in Section 4.1. Corollary 4.7 gives a new bound in the online 
learning setting (sequential prediction with expert advice). The gen- 
eralization error bounds obtained by Juditsky, Rigollet and Tsybakov 
in [38] are recovered for a slightly different algorithm in Section 5. 

Without any extra assumption on the learning task, we have ob- 
tained a Bernstein's type bound which has no known equivalent form 
when the loss function is not assumed to be bounded (Section 6.1.1). 
When the loss function is bounded, the use of Hocffding's inequality 
w.r.t. Gibbs distributions on the prediction function space instead of 



Proof. See Section 10.14. 



□ 
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the distribution generating the data leads to an improvement by a 
factor 2 of the standard-style risk bound (Theorem 6.4). 

To prove that our bounds are minimax optimal, we have refined 
Assouad's lemma particularly by taking into account the properties 
of the loss function. Theorem 8.7 is tighter than previous versions 
of Assouad's lemma and easier to apply to a learning setting than 
Fano's lemma (see e.g. [51]), besides the latter leads in general to very 
loose constants. It improves the constants of lower bounds related to 
Vapnik-Cervonenkis classes by a factor greater than 1000. We have 
also illustrated our upper and lower bounds by studying the influence 
of the noise of the output and of the convexity of the loss function. 

For the L g -loss with q > 1, new matching upper and lower bounds 
are given: in the online learning framework under boundedness as- 
sumption (Corollary 4.5 and Section 8.4.2 jointly with Remark 8.1), in 
the batch learning setting under boundedness assumption (Section 4.1 
and Section 8.4.2), in the batch learning setting for unbounded obser- 
vations under moment assumptions (Section 7 and 8.4.2). In the latter 
setting, we still do assume that the prediction functions are bounded. 
It is an open problem to replace this boundedness assumption with a 
moment condition. 

Finally this work has the following limits. Most of our results con- 
cern expected risks and it is an open problem to provide corresponding 
tight exponential inequalities. Besides we should emphasize that our 
expected risk upper bounds hold only for our algorithm. This is quite 
different from the classical point of view that simultaneously gives up- 
per bounds on the risk of any prediction function in the model. To our 
current knowledge, this classical approach has a flexibility that is not 
recovered in our approach. For instance, in several learning tasks, Dud- 
ley's chaining trick [35] is the only way to prove risk convergence with 
the optimal rate. So a natural question and another open problem is 
whether it is possible to combine the better variance control presented 
here with the chaining argument (or other localization argument used 
while exponential inequalities are available). 

10 Proofs 

10.1 Proof of Theorem 4.4 

First, by a scaling argument, it suffices to prove the result for a = 
and 6=1. For y = [0; 1], we modify the proof in Appendix A of [39]. 
Precisely, claims 1 and 2, with the notation used there, become: 

1. If the function / is concave in a([p;g]) then we have A t {q) < 
B t (p), 
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2. If c > R(z,p, q) for any z £ (p; 5), then the function / is concave 
in a{[p;q\). 

Up to the missing a (typo), the difference is that we restrict our- 
selves to values of z in [p;q\. The proof of Claim 2 has no new ar- 
gument. For claim 1, it suffices to modify the definition of x' t i into 
x' tl = q A G^ 1 [£(p,x t , l )} G [p;q\. Then wc have L(p,x' ti ) < L(p,x t ,i) 
and L(q, x' t i ) < L(p, x t .i), hence a{x' t i ) > a(x t .i) and j[x' tii ) > l{x t ,i)- 
Now one can prove that / is decreasing on a([p; q]). By using Jensen's 
inequality, we get 



- cl °sEr=i z; t,i/[ a K ) i)] 



> 

> 
> 



-clog/ 
-clog/ 



L[q,G-\A t (p))] 



The end of the proof of claim 1 is then identical. 



10.2 Proof of Corollary 5.1 

We start by proving that the variance inequality holds with 8\ = 0, and 
that we may take it(p) be the Dirac distribution at the function E g ^ p <?. 
By using Jensen's inequality and Fubini's theorem, Assumption (5.1) 
implies that 

E^o.) E z ^p logE^ e W,9')-L{z,9)\ 

= E z ^p logE^ p e x ^ z ' K »'^ 9')-l(z, 9 )] 

< \ogE g ^ p E Z ^p e^ L( - Z ' E 9'- P 9')-L{Z,g)) 

< logE^ p -0(E 9 '~p 9', g) 

< log-0(E g /^ p g',E gr ^ p g) 
= 0, 

so that we can apply Theorem 3.1. It remains to note that in this 
context the SeqRand algorithm is the one described in the corollary. 

10.3 Proof of Theorem 6.1 

To check that the variance inequality holds, it suffices to prove that 
for any z£2 

E g ,„ p l0gE 3 ,. p e HL(z, S ')-L(z, g )]-4[L( Z , g >)~L(z,g)f < g (10.1) 
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To shorten formulae, let a(g f ,g) = X[L(z 1 g') — L(z,g)]. By Jensen's 
inequality and the following symmetrization trick, (10.1) holds. 

E / E 2 

< -E / E e a(g',g)- a % ' g) I lw , 5? e-^flSg)- " c | ,g) 

<E g ,^ p Eg^p cosh (a(.g,g'))e-° a " 

< 1 

J10.2) 

where in the last inequality we used the inequality cosh(i) < e* / 2 for 
any iel. The result then follows from Theorem 3.1. 



10.4 Proof of Corollary 6.2 

To shorten the following formula, let /x denote the law of the prediction 
function produced by the SeqRand algorithm (w.r.t. simultaneously 
the training set and the randomizing procedure). Then (6.1) can be 
written as: for any p G M, 

R(g') < E g ^ p R(g) + f E ff „„ V(g, g 1 ) + (10.3) 

Define R{g) = R{g) — R(g) for any g £ Q. Under the generalized 
Mammen and Tsybakov assumption, for any g, g' £ Q, we have 

\V{g,g') < E z {[L(Z,g)-L(Z,g)} 2 }+Ez {[L(Z, g') - L(Z,g)} 2 } 
< cRV(g)+cky(g% 

so that (10.3) leads to 

E sW , [R(g') - c\&(g')\ < E g ^ p [R(g) + cXW(g)} + gg}. 

(10.4) 

This gives the first assertion. For the second statement, let u = 
Eg'~^t R{g') and x( u ) = u — cXu 1 . By Jensen's inequality, the l.h.s. 
of (10.4) is lower bounded by x(u). By straightforward computations, 

for any < /3 < 1, when u > (j^p) , x( u ) is lower bounded by f3u, 
which implies the desired result. 



10.5 Proof of Theorem 6.3 

Let us prove (6.3). Let r(g) denote the empirical risk of g £ Q, that 
is r(g) = Sn ( g > . Let p £ A4 be some fixed distribution on Q. From 
[5, Section 8.1], with probability at least 1 — e w.r.t. the training set 
distribution, for any fi £ A4, we have 

R{g') - E g ~ P R(g) 

< Eg,^ r(g') - E g ^ p r(g) + X^(XB)E g ^ p E^ p V(g,g') + glg^lH^ll , 
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Since the Gibbs distribution 7t_as„ minimizes p t— > E ff '^ M r W) + 
K ^xn^ ■' we have 

< E g . p R(g) + A V (AB)E fl ,^_, SB E^ p + ^HM'- 1 ) , 

Then we apply the following inequality 

EW < E(W V 0) = J + °° P(W^ > u)du = Jo e^P(W > log^" 1 ))^ 

to the random variable 

W = \n[E g ,^_ xs R(g') - E^ p R(g) 

-M\B)E g ,^_ XSn E g „ p V(g,g')]-K(p,ir). 

We get EW < 1 . At last we may choose the distribution p minimizing 
the upper bound to obtain (6.3). Similarly using [5, Section 8.3], we 
may prove (6.2). 

10.6 Proof of Lemma 6.5 

It suffices to apply the following adaptation of Lemma 5 of [60] to 

&(Zi, ...,Zi) = L[Zi, A{Z\~ X )\ - L{Zi,g). 

Lemma 10.1. Let ip still denote the positive convex increasing func- 
tion defined as tp(t) = e ■ Let b be a real number. For i = 
1, . . . ,n + 1, let : Z % — > K be a function uniformly upper bounded 
by b. For any r\ > 0, e > 0, with probability at least 1 — e w.r.t. the 
distribution of Zi, . . . , Z n+ i, we have 

ES 1 UZu ...,Zi)< Eti ®zMZi, ...,Zi) 

E7=! ®zM{Zi, ...,z i ) + i^p, 

(10.5) 

where Ez { denotes the expectation w.r.t. the distribution of Zi only. 

Remark 10.1. The same type of bounds without variance control can 
be found in [26]. 

Proof. For any i € {0, . . . , n + 1}, define 

where £j is the short version of ^j(Z\, . . . , Zj). For any i € {0, . . . , n}, 
we trivially have 

- = 6+i - E Zl+i e t +i - wfa&jEjSi+i&i- (10.6) 

Now for any b € M, 77 > and any random variable W such that W < b 
a.s., we have 

jgg^W-EW-wC^EW 2 ) < L (10.7) 
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Remark 10.2. The proof of (10.7) is standard and can be found e.g. in 
[4, Section 7.1.1]. We use (10.7) instead of the inequality used to prove 
Lemma 5 of [60], i.e. Ee^ w ~ EW ~^b' ')E(w -Ewf] <i{ orW -EW< 
b' since we are interested in excess risk bounds. Precisely, we will take 
W of the form W = L(Z,g) — L(Z,g') for fixed functions g and g' . 
Then we have W < sup z L — inf z>g L while we only have W — EW < 
2(sup 2 g L — inf 2j£) Lj. Besides the gain of having E(VF — EW-^) 2 instead 
of EW^ is useless in the applications we develop here. 

By combining (10.7) and (10.6), we obtain 

E^ i+1 e'?W' i + 1 -'W < 1. (10.8) 

By using Markov's inequality, we upper bound the following probability 
w.r.t. the distribution of Z\, . . . , Z n+ i. 

p( ElS 1 6 > ES 1 ^zM + w(vb) ES 1 ^ztt + ±*p) 

= P(r?V„+i > log^ 1 )) 
= P(ee^"+ 1 > 1) 

< eEe'''^ 1 

< eE Zl (e^-^Ezi (■■■ e^-^-^Ez n+1 e ri ^ n + 1 -^)) 

< € 

where the last inequality follows from recursive use of (10.8). □ 



10.7 Proof of Theorem 7.2 

The first inequality follows from Jensen's inequality. Let us prove the 
second. According to Theorem 3.1, it suffices to check that the variance 
inequality holds for < A < Ao, tt(p) the Dirac distribution at E g ^ p g 
and 



l \y\>B 



S x [(x,y),g,g'] = S x (y) ± mm [(A(y) + ^<^ A ® 

= XA 2 V lxA{y)<l;\y\>B + [Mv) ~ Jx] 1 \A(y)>l;\y\> B ■ 



• For any z = (x,y) £ Z such that \y\ < B, for any probability 
distribution p and for the above values of A and 5\, by Jensen's 
inequality, we have 



E gr ^ p e 



\[L(z,E g ,^ p g')-L(z,g)-S x (z,g,g')} 



E 



9~P 



-\l[v,g{*)\ 
e -A %,s(x)] 



= 1, 



A/A 



53 



where the last inequality comes from the concavity of y' i— > e~ x °^ y ' y ) . 
This concavity argument goes back to [40, Section 4], and was 
also used in [21] and in some of the examples given in [38]. 

• For any z = (x,y) £ Z such that \y\ > B, for any < ( < 1, 
by using twice Jensen's inequality and then by using the sym- 
mctrization trick presented in Section 6, we have 



e X[L(z,E g ,^ p g')-L(z,g)-6 x (z,g,g')] 

= e -^(v)E „ e x[L(zM o'~r b')~ l ^,9)] 

< e -8*(v)-E e A [ E 9 '~P L(z,g')-L(z,g)\ 

< e- s ^Eg^ p E g ^ p e *[H*,9')-L(z,g)] 

= e- S ^E n ~ n E„/^„ ^ e Mi-C)lL(z,g')-L(z,g)]-±\ 2 (l-C) 2 lL(z,g')-L(z,g)] 



^g~p ^g'~p 

X e 



\t[L(z,g')-L(z,g)] + ±\ 2 (l-C) 2 [L(z,g')-L(z,g)] 2 



< e-^MEj^ E g ^ p | e ^(l"C)[i(^9')-i(z,9)]-^A 2 (l-C) 2 [L(^ 9 ')-i(^9)] 

x e ACA(y) + iA 2 (l-C) 2 A 2 fe)| 

< e -S x (y) e \(A(y) + ±\ 2 (l-C) 2 A 2 (y) 

Taking £ £ [0; 1] minimizing the last r.h.s., we obtain that 

E,g^ p e A l L ( z ' E s'~P 9')-L(z,g)-8 x (z,g,g')] < ^ 

From the two previous computations, we obtain that for any z£Z, 
logE g ^ p e x[L{z ' E s>'~p a , )-H',B)-S^z,g,g')) < 0) 

so that the variance inequality holds for the above values of A, 7r(p) 
and S\, and the result follows from Theorem 3.1. 

10.8 Proof of Corollary 7.5 

To apply Theorem 7.2, we will first determine Ao for which the function 
C : y' !— * e~ x °' v ~ v I" is concave. For any given y <G [—B;B], for any 
g > 1, straightforward computations give 

CV) = [Xoq\y' - itf - («? - 1)] A g|y' - y^-^-^U/^T 

for y' £ y, hence C" < on [-6; b] - {y} for A = ^f^. Now 
since the derivative £' is defined at the point y, we conclude that the 
function £ is concave on [—6; 6], so that we may use Theorem 7.2 with 

A " - g (s+b)9 ■ 

Contrary to the least square setting, we do not have a simple close 
formula for A(y), but for any \y\ > 6, we have 

2&g(M - 6)?- 1 < A(y) < 26g(|y| + ft)'" 1 . 
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As a consequence, when \y\ > b+ [2bq\)^ 1 /^ q ~ 1 \ we have AA(y) > 
1 and A(y) — 1/(2A) can be upper bounded by C"|j/| 9_1 , where the 
quantity C depends only on b and q. 

For other values of \y\, i.e. when b < \y\ < b + (2bq\)~ 1 ' we 
have 



lAA(y)<l;|y|>B + [A(j/) - gj] 1 AA(y)> l;|y I >B 



2 



l|y|>B 



= mm 
o<c<i 

< iAA 2 (y)l M>B 

<2A&V(|2/| + &) 29 - 2 1 M >b 

< C"A|j/| 2 «- 2 l |y|>B! 

where C" depends only on b and q. 

Therefore, from (7.2), for any < b < B and A > satisfying 
< ' ^ ne ex P ec ted risk is upper bounded by 



min \ Eg^ p 



+ X(n+1) } + EjC'lFl 9 1 l|y|> 6+ (2( )g A)-i/(<!-i);|y|>B} 

+E{C"A|y| 2<?_2 l B< |y| <b+(2&(?A:) -i/(< I -i) }. 

(10.9) 

Let us take B = (^x-) 1 ^ 9 ^ b with A small enough to ensure that 
b < B < b + {2bq\)- l / {q - 1 \ This means that A should be taken 
smaller than some positive constant depending only on b and q. Then 
(10.9) can be written as 



mm 

p£M 



{E gr ^ p R(g) + + E{C tf |y|«- 1 l| y |> 6+(a6gA) - 1/{ ,- 1 , } 

+E{C"A|F| 2<? 2 l(9 5 _i)i/<,_; )< |y| <b+ (26gA)- 1 /(9 



Now using (7.5), we can upper bound (10.9) with 

min R(g) + ^2gM + CA^" + CA( A 5 l s > 2<3 _ 2 + A^^l s<29 _ 2 
gee V 

where C depends only on b, A, q and s. So we get 

min 

gee 

min 

gee 



< mm R(g) + + CA^ + CA^ l s > 2? _ 2 

< min iZ(g) + + <7A^l s<2g _ 2 + C\^\ s > 2q - 



since s ^~ 9 > s ~^ +2 is equivalent to s > 2q — 2. By taking A of order 
of the minimum of the r.h.s. (which implies that A goes to when 
n/log|(?| goes to infinity), we obtain the desired result. 
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10.9 Proof of Lemma 8.2 



Let A* = f r {U) - fifa). We have 

f(a) - af(l) - (1 - a)/(0) 

= /(a)-/(0)-a[/(l)-/(0)] 
r 1 

f'(u)du — a f'(u)du 



(f u f"(t)dt+ J2 A *) du 

-a [ ( f f"(t)dt+ A ?) du 
(lo<t<«<a - al 0<t<u< i) f" (t)dtdu 

[0;i] 2 

+ [(a-U)l u<a -a(l-ti)]At 

fct/£(0;l) 

= -/ K a {t).f"{t)dt- K a (te)A e 

£:t £ 6]0;l[ 

10.10 Proof of Theorem 8.6 

The symbols 01, ... , a m still denote the coordinates of a £ { — ; +} m . 
For any r € {-; 0; +}, define Oj> = {ax,..., <Tj-i,r, crj+i, a m ) as 
the vector deduced from a by fixing its j-th coordinate to r. Since CTj _j. 
and (jj- belong to { — ; +}"\ we have already defined P» j + and P&^_ ■ 
Now we define the distribution P Sj a as P Sj „ (dX) = n(dX) and 

l-Pa ji0 (Y = h 2 (X)\X) 

= P 9 . (Y = h 1 (X)\X) = i l { Z an \ X ,^ v. 

3 - oV v 71 ; \ P ff (r = /ii(X)|X) otherwise 

The distribution P s . differs from P s only by the conditional law of the 
output knowing that the input is in Xj. We recall that P® n denotes 
the n-fold product of a distribution P. For any r £ { — ;+}, introduce 
the likelihood ratios for the data Z™ = (Z\, . . . , Z n ): 

Note that this quantity is independent of the value of a. In the follow- 
ing, to shorten the notation, we will sometimes use hi for h\(X), ft, 2 
for /i2(X), p + for p + (X), p_ for p_(X). Let v be the uniform distri- 
bution on { — ,+}, i.e. z^({+}) = 1/2 = 1 — v({— }). In the following, 



5G 



E ff denotes the expectation when a is drawn according to the m-fold 
product distribution of v, and Ejc = Ex~/j- We have 



p^{^V^ )_min9jR(5) } 



> sup {e z „^ p «. E z ^p, £[Y, g(X)} - mm g E Z ^ P& 1% g(X)} } 
ffe{-;+}'" L 1 ' 

= sup < E z „ 
ffe{-;+}'" L 1 



p®n Ex~P s (dX) 



E Y ^ (dYlx) £[Y,g(X)} 



sup < E z „ 

?£{-; + }'" L 1 



~fiy E Y~ Pi ,(dY\X) t(Y, y)] } 



. » lax 



> E ff E z „^ p «„ E x 



X (ip Paj ,hi,h 2 [g{X)} - <t>h u h 2 [Pa 3 ] ) 



1 "i,o 



-(Zf) 



J I 1 



(10.10) 

The two inequalities in (10.10) are Assouad's argument ([3]). For any 
x € X, introduce 



i>x(u) = \{u+l)^ p+{x ),p„ {x)M{x)M{x )(-^ l ). 



Introduce 
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The last expectation in (10.10) is 

E ff ~„ n aj (Z{ l )(^p pAX)M{x)M(x) [g(X)} - 4> hl {X)M(x)[P<y{X)] 

x[ aj {Z^ p+MM [g{X)\ + [l- aj (Z?)}<p p _, huh MX)} 
-a j (Z?)<f> huha (p+) - [1 - a^Zn^MiP-)} 

= hi^+di 2 ") +^-di z i)] [ l Pa j (z?yp + +[i- aj (z?)]p-,h 1 ,h 2 [9( x )} 
-a 3 (Z^)<f) hlM (p + ) - [1 - aj(^")]^hi,ft 2 (p-)} 

> | [7r +J (ZT) + tt-j^)] {^ Ilh2 (a i (^T)p + + [1 - ai(^)]P- 
-ay(2?)0 /lll h a (p + ) - [1 - aj (Z?)]<j> huh2 (j>_)} 

= $[* + AZ?) + K-j(Z?)]l, p+iP _, huha [ aj (Z?)] 

= 7r _ A Zf)^ x (^ff j ) 

(10.11) 

so that 

sup <^ E R(g) — min s P(g) f 
pg-p lz*»~p®" J 

= E™ i Ex { 1* 6 *, E*S^ , ) } . 

Now since we consider a hypercube, for any j <= {1, ... , to}, all the 
terms in the sum are equal. Besides from part 2 of Lemma 8.5, the 
last /-similarity does not depend on a, and in particular for j = 1, the 
/-similarity is equal to S^ x (Plm\ P® ™) , where we recall that P[ + ] and 
Pr_i denote the representatives of the hypercube (see Definition 8.2). 
Therefore we obtain 



sup {ER (g) - min g R(g)} > mE x [ l X& x^ x (P^ 1 , Pffi) } 



- ^V[+]^[-]> 
where the second to last equality comes from the second part of Lemma 8 

10.11 Proof of Theorem 8.7 

First, when the hypercube satisfies p+ = 1 = 1— p_, from the definition 
of di given in (8.6), we have S$ (P^, Pj®") = mwdj(l - w) n so that 
Theorem 8.6 implies (8.19). 

Inequalities (8.17), (8.18) and (8.20) arc deduced from Theorem 8.6 
by lower bounding the -0-similarity in different ways. 
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Since u i — ► it A 1 and u i— > — arc non-negative concave functions 
defined on R + , we may define the similarities 

/ 5 A (P,Q)^/(|Al)dQ = J 
1 5.(P,Q)^/5^dQ = / 



where the second equality of both formulas introduces a formal (but 
intuitive) notation. 

From Theorem 8.6, Lemma 8.3 and item 1 of Lemma 8.5, by using 
■0(1) = mwdi, we obtain 

Corollary 10.2. LetV be a set of probability distributions containing 
a hypercube of distributions of characteristic function tjj and represen- 
tatives Pr_i and Pni . For any estimator g, we have 

sup {ER(g)~ min R(g)} > mwa\S A (P®" P®") (10 12) 

where the minimum is taken over the space of prediction functions. Be- 
sides if for any x £ X\ the function 4>h 1 (x).ho(x) is twice differ entiable 
and satisfies for any t E \p-(x)Ap+(x);p-(x)Vp+(x)], -flhitoMtefo) - 
£ for some C > 0, then we have 

sup{EP(g)-minP( g )} > ^d[ S.(P®» P®»); (10 . 13) 

The following lemma and (10.12) imply (8.17) and (8.18). 
Lemma 10.3. We have 



s*(p$,pff) > i - sfi - [i - > i - v^an- (io.i4) 

When the hypercube is symmetrical and constant, for N a centered 
gaussian random variable with variance 1, we have 



S A (P®», Pffi) > F(\N\ > /g|) - 4 /4 (10.15) 

Proof. See Section 10.11.1. □ 

Remark 10.3. It is interesting to note that (10.15) is asymptotically 
optimal to the extent that for a (m, w, G?n)-hypercube (see Definition 
8.3), we have 

5 A (i^?,P®?)-P(|JV-|>-v«5) 0, (10.16) 

[Proof in Appendix C] 

The following lemma and (10.13) imply (8.20). 

Lemma 10.4. When the hypercube is symmetrical and constant, we 
have 



l+dl 



S. (iffi, P^) > i{l + Mi - (i - VT^)«] " - Mi + (vi 

Proof. Sec Section 10.11.2. □ 



1 
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10.11.1 Proof of Lemma 10.3 

For a G { — ,+}, the conditional law of (X, Y) knowing X G X\, when 
(X, Y) follows the law P^j , is denoted P Xl j0 and is called the restricted 
representatives of the hypercube. More explicitly, the probability dis- 
tribution Px u a is such that its first marginal Px lt a{dX) is fj,(»\Xi) and 
for any x G X\ 

Px u c{Y = hi{x)\X =x)= Pa {x) = 1 - P Xua (Y = h 2 (x)\X = x). 

The following lemma relates the similarity between representatives 
of the hypercube and the similarity between restricted representatives. 

Lemma 10.5. Consider a convex function 7 : M+ — ► M+ such that 

for any k G {0, ...,n} 7 where by convention S \ (P®° + , Pj? '_) = 1- 
For any estimator g, we have 

5 A (P^,P^)> 7 (^). 

Proof. For any points Z\ = (2:1,2/1), ...,z n = {x n ,y n ) in X x {hi, h 2 }, 
let C(zi, . . . , z n ) denotes the number of Zi for which x% G X\. For any 
k G {0, . . . , n}, let B k = C _1 ({fc}) denote the subset of {X x {/ii, h 2 }) n 
for which exactly A: points are in X\ x {hi,h 2 }. We recall that there 
are (^) possibilities of taking k elements among n and the probability 
of X G X\ when X is drawn according to fx is 10 = Let 2i = 

Ai x {/ii, ft, 2 } and let Zf denote the complement of Z\. We have 

/ pg>" s 

= JlA^^.-A))^!,..,^) 

= ELo /J, 1 A • ■ ■ ^}(^))dP H (*l) ■ • • dP H (^n) 

= ELo (2) Wx(* f )«- 1 A (W^(^) ■ ■ ■ ^(Zn)W^l, ■ ■ ■ , Zn) 

= ELo G) W*w)-* 1 A few ■ • • ftr^)) dP H ■ • • - *0 

= ELo O^f) 1 A (5±l( 2l ) • • • 5±l(z fc ))dP^](.i, ...,**) 

= ELo (D^-Hz^Hz^AiP^Pl*-) 
> ELoO(i-^)"- fe ^7W 

= B7(V) 

(10.17) 

where is a Binomial distribution with parameters n and w. By 
Jensen's inequality, we have E7(V) > 7pE(V)] = j(nw), which ends 
the proof. □ 
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The interest of the previous lemma is to provide a lower bound on 
the similarity between representatives of the hypercube from a lower 
bound on the similarity between restricted representatives, restricted 
representatives being much simpler to study. The following result lower 
bounds the A-similarity between the restricted representatives of the 
hypercube. 

Lemma 10.6. For any non-negative integer k, we have 



5 A (Pf* + , Pf*_) > 1 - v/l-[l-d„]* > 1 - VkdTi, (10.18) 

When the hypercube is symmetrical and constant, for N a centered 
gaussian random variable with variance 1, we have 



SA(i%%,Pf*_) > p(M > VraJ ~<hi ■ ( 10 - 19 ) 

Proof. First, we recall that Pq denotes the base of the hypercube (see 
Definition 8.2). The conditional law of (X, Y) knowing X G Xi, when 
(X,Y) is drawn from Pq, is denoted Px lt o- 

For any r G { — ,0,+}, introduce P r>x the probability distribution 
on the output space such that P TiX (dY) — Px 1 , r {dY\X = x). We have 



®fe 



X±,0 #1,0 



E JCf~pj?* E n fc ~-Pf 1 fc ol^i fc 



(10.20) 

where ®i—\P r ,Xi, r G {—1; 1} denotes the law of the fc-tuple (Yi, . . . , Y&) 
when the Y$ are independently drawn from P r> Xi ■ 

To study divergences (or equivalcntly similarities) between A:-fold 
product distributions, the standard way is to link the divergence (or 
similarity) of the product with the ones of base distributions. This 
lead to tensorization equalities or inequalities. To obtain a tensoriza- 
tion inequality for <S A , we introduce the similarity associated with the 
square root function (which is non- negative and concave): 



5^(P,Q)4/ 
and use the following lemmas: 

Lemma 10.7. For any probability distributions P and Q, we have 



5 A (P,Q)>1-^1-5^(P,Q). 

Proof. Introduce the variational distance V(P, Q) as the /-divergence 
associated with the convex function / : u ±\u — 1|. From Schcffc's 
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theorem, we have S A (P, Q) = 1 — V(P, Q) for any distributions P and 
Q. Introduce the Hcllingcr distance H, which is defined as H(F, Q) > 



and 1 



s 



for any probability distributions P and 



2 - ^ 

Q. The variational and Hellinger distances are known (see e.g. [51, 
Lemma 2.2]) to be related by 



V(¥, 



< \ I 



1 



H 2 (PS 



hence the result. 



□ 



we 



Lemma 10.8. For any distributions P^, . . . ,P( fc) , (Q)W, ...,<£ 

5^(p(i) <g> . . . ® p(*0 , Q(l) ® . . . ® Q( fc )) 

= S^(pW,Q( 1 )) x ■•• x5r(P( fc ),Q< i: ') 

Proof. When it exists, the density of P (1) <g> ■ ■ ■ ® P (fe) w.r.t. <g> 
• ■ -®Q( fe ) is the product of the densities of PW w.r.t. QW, i = 1, . . . , k, 
hence the desired tensorization equality. 

□ 



From the last two lemmas, we obtain 



P+ 



,X ( ,®S=l-P-^E 



> 1 



From (10.20), (10.21) and Jensen's inequality, we obtain 



(10.21) 



i- uli s^(p + ^,p-,x, 



i - 



> i-< 
= i - 

= i-yi 

Now we have 



l X*~P®* rii=i -S^r (- P +,X 4 , f-.Xi 



E 



,P- 



X 



E 



(•|Afi) ^(P+^P-,* 



= 1 - E m(»I*i) [v / P+( 1 ~ V( 1 ~P+)P- 

= 1 - d n 



So we get 



MPSZ+'Px*-) > 1 -Vl-(l-du) fc >l 



/ /cdr 



(10.22) 
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where the second inequality follows from the inequality l—x k < k(l—x) 
that holds for any < x < 1 and k > 1. This ends the proof of (10.18). 

For (10.19), since we assume that the hypercube is symmetrical and 
constant, we can tighten (10.22) for k^/du > 1. We have 



^f-_{Z\) < l) +i^*_(^±(Zf) > l). 

(10.23) 

Since ^±(z) = g^gggg = for any z = (*,„) € Z, we 

have 

£xt±(yk\ _ yik P + , Xi {Yj) 
p®h ) — lli=i p_ iX .(Yi) 

" y 4 =h 2 (x 4 ) 



5 A (i^: + ,i , | [ *_)=^. 



(10.24) 

Using that the hypercube is symmetrical and constant. (10.24) leads 
to 



\ li- I = h 1 (x i )-ly i = h 2 (^i) 

(10.25) 



-^w^y^i) - iu=i \ i-p+(x. 



.1-P4 



(x i )-J-y i = ''2(x i )J 



Without loss of generality, we may assume that p+ > 1/2. Then we 

havep+ = 1 -p- = Introduce W t = \ Yi=hl{Xl ) ~ l Yi =h 2 (Xi)- 

From (10.23) and (10.25), we obtain 

S A (P^ + ,P^ t _) = Pl k + (T,LiWi<0)+Pl k t _(T,LiWi>0) 

= P| 1 fc -(E-=iW i >o)+pf i fe _(E- =1 w i >o) 

The law of U = J2i=i W% when (Xi,Yi), . . . , (X n ,Y n ) are indepen- 
dently drawn from Px ± ,- is the binomial distribution of parameter 
(k, Let [x\ still denote the largest integer k such that k < x. 

We get 

S A (P^ + ,P^_) = ¥{U > k/2)+P(U >k/2) 

> 2P(U > [k/2\ ) - 2P(U = [k/2\ ) 

When fcydn > 1, this last r.h.s. can be lower bounded by Slud's theo- 
rem [49] for the first term and by using Stirling's formula for the second 
term (see e.g. [33, Appendix A. 8]). It gives 



2[fc/2] -fc(l- N /d^ : ) 
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where we recall that N is a normalized gaussian random variable. Fi- 
nally we have 



1 — \fkd\[ 4 for any k > 1 

^ P(|7V| > yfgfc) ^4( 4 for any k > -fe. 

which can be weakened into for any non-negative integer k 



"ii > 



s A (p® i k t+ ,p® i k ! _)>v(\N\> 

that is (10.19). □ 



By computing the second derivative of u i— ► Vl e 11 and u i— > 
Jo^e -4 (it, we obtain that these functions are concave. So for any 
a e [0; 1], the functions x i— > 1 — Vl — a 31 , a; i — ^ 1 — v 7 ^? and x i— * 
P(JiV| > J jzzgj — a 1 ^ 4 are convex. The convexity of these functions 
and Lemmas 10.5 and 10.6 imply Lemma 10.3. 

10.11.2 Proof of Lemma 10.4 

Let 9 : u i— > u/(u + 1) denote the non-negative concave function on 
which the similarity 5, is defined. For any u > 0, we have 

hence for any probability distributions P and Q, 
S.(P,Q) - /<?(5)dQ 

> J/(dP + dQ + VaPTO-^r-^r) 

= | + i / VrfPrfQ - g / §y72 - | / ^172 

The goal of this bound is to obtain a form for which tensorization 
equalities hold. Precisely, let h = J y/dP[ + ] dP[_] and h — j Jpt~k = 



J — where the last equality holds since the hypercube is symmet- 
dP [+] 

rical. We have 

C { p<g>« p(g)n\ x 1 , 1 rn 1 pi 
V [+] ' M-] > - 2 + I 1 ! ~ 4 J 2 

Since the hypercube is symmetrical and constant, without loss of gen- 
erality, we may assume that > i on X\ . Then we have 1 — p_ = 
p+ — (1 + v / rfii")/2, hence Ii =1 —w + — du and 

(i+VrfI7) 3/2 , (i-VrfI7) 3/ ^ _ i _ ,„ , ,„ 



h = 1 ~ W + H ^vg. + 7 V^n =l- U , + l( ; 



2 V(i-V^7) 1/2 1 (i+V^7) 1/2 y v 7 !^ 
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which gives the desired result. 



10.12 Proof of Theorems 8.11 and 8.12 

We consider a (m, w, dn)-hypercube (see Definition 8.3 [p. 32]) with 

rh = [log 2 \Q\\, 

hi = —B and h% = B, and with w and d\\ to be taken in order to 
(almost) maximize the bound. 

Case q = 1 : From (8.27), we have d\ = ^-\h 2 -h x \= B^JdYi so 
that, choosing w = 1/rh, (8.17) gives 



sup{Ei?(s) - mm R{g)\ > By/du(l - y/nd n /fh) . 
Pen 9 

Maximizing the lower bound w.r.t. du, we choose dn = j 2 - A 1 and 
obtain the announced result. 



Case 1 < q < 1 

< e < 1, we have 



A 1 : From (8.11) and (8.29), for any 



di > 

> 



ir Jijf [* a (i - t)]|< li/i2 + VdEt) 



dn e(2-e) 
2 4 



d n x 



inf 



<7 r i-e 2 d„ l §Ef (2-B) 9 

o-l L 4 J , , , 

2«+i[(l+eV3iI 

> i(^) dllX ^ ( i_ e v^)^(i + e v^) 

= (i_e/2)gS«|^(l-eV5i)^(l + eV3n) 



5TT- 

)/2]9-i 

l-2g 
9-1 
l-2g 
9-1 



Let X = (1 - e^/du)^ : (1 + eVdii)^ • From (8.17), taking w = 
we get 



sup {ER{g) - mm R(g)\ > (1 - e/2)KqB*f4sL(l - y/ndu/m). 
Pen s 1 

(10.26) 



This leads us to choose du = ^ A 1 and e = (g — l)y V | < | and 
obtain 

ER(g) - mm a R(g) > ^1 K {{\^) V (l - )}. 

Since 1 < g < 2 and eVdn = ^g-, we may check that if > 0.29 (to be 
compared with lim^i A' = er 1 w 0.37). 
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Case q > 1 + J ^ : Wc take w=^Ai. From (8.4), (8.6) and 
(8.28), we get dj = Vi,o,-fl,fl(l/2) = ^_ B , B (l/2) = S 9 . From (8.19), 



we obtain 

gee 



Efl(ff) - min > (i^Al)B«(l-^A II ^)' 



> e-iS^MAl), 
where the last inequality uses [1 — l/(n + 1)]™ \ e _1 . 



(10.27) 



Improvement when l + y^Al<g<2: From (10.26), 
by choosing e = 1/2 and introducing K' = (1 — v / 3n/2)^ rT (l + 

l-2q 

vdn/2) , we obtain 

sup{Ei?(g)-mini?(, 9 )} > ^'^(1-7^). 
Pew f 

This leads us to choose dn = ^ A 1. Since J^iu < g — 1, we have 

v^n < |(3 - 1), hence > (l - |(« - l))^(l + §(g - 1))^. 
For any 1 < q < 2, this last quantity is greater than 0.2. So we have 



proved that for 1 + J ^ A 1 < q < 2, 

ER(g) mm i?(.g) > ^B^^Ml. (1Q 2g) 

Theorem 8.12 follows from (10.27) and (10.28). 
10.13 Proof of Theorem 8.14 

10.13.1 Proof of the first inequality of Theorem 8.14. 

Let rh = [log 2 \G\\- Contrary to other lower bounds obtained in this 
work, this learning setting requires asymmetrical hypercubes of distri- 
butions. Here we consider a constant m-dimensional hypercube of dis- 
tributions with edge probability w such that p+ = p, p— = 0, hi = +B 
and h% = 0, where w, p and B are positive real parameters to be 
chosen according to the strategy described at the beginning of Section 
8.4. To have E|^| s < A, we need that rhwpB s < A. To ensure that 
a best prediction function has infinite norm bounded by b, from the 
computations at the beginning of Appendix A, we need that 

b < ^ q -y^ i/(q - i} b . 

This inequality is in particular satisfied for B = Cp -1 ^ 9-1 ) for appro- 
priate small constant C depending on b and q. From the definition of 
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the edge discrepancy of type II, we have d\\ = p. In order to have the 
r.h.s. of (8.17) of order mwdj, we want to have nwp < C < 1. All the 
previous constraints lead us to take the parameters w,p and B such 
that 

B = Cp-V^-i) 
rhwpB s — A 
nwp =1/4 

Let Q = f A 1. This leads to p = CQ^- 1 ^ 3 , B = CQ-^ S and 
w = Cfhr 1 Q 1 ~( q ~ 1 ^ s with C small positive constants depending on b, 
A, q and s. Now from the definition of the edge discrepancy of type I 
and (8.5), we have 

di = ^J^[tA(l-t)]\^ tB (tp)\dt 

^ fl/4 3 min [p/4;3p/4] 1 0o,B | dt 

> Cp 2 p^Bi 
= C 

where the last inequality comes from (8.29). From (8.17), we get 

sup{ER(g) - miii R(g)\ > CQ 1 "^. 
p<=v g&G 



10.13.2 Proof of the second inequality of Theorem 8.14. 

We still use m = [log2 \Q\\- We consider a (to, w, c?n)-hypcrcube with 
hi = —B and 1%2 = +B, where w, du and B are positive real parameters 
to be chosen according to the strategy described at the beginning of 
Section 8.4. To have E|F| S < A, we need that fnwB 3 < A. To ensure 
that a best prediction function has infinite norm bounded by b, from 
the computations at the beginning of Appendix A), we need that 

p < [i+(d„) 1 / 2 ] 1 /^- 1 »+[i-(d„) 1 / 2 ] 1 /"'- 1 > , nn?cn 

- [l + (d„) 1 /2]l/(< 1 -l)_[l_( dlI )l/2]l/(< ! -l) u - 

1/2 

For fixed q and b, this inequality essentially means that B < Cd\\ 
since we intend to take d\\ close to 0. In order to have the r.h.s. of 
(8.17) of order mwdi, we want to have nwdn < 1/4 where, once more, 
this last constant is arbitrarily taken. The previous constraints lead 
us to choose 

B = Cdu~ 1/2 
rhwB s — A 
nwdu = 1/4 

We still use Q = f Al. This leads to d u = CQ 2 ^ s+2 \ B = CQ" 1 /^ 2 ) 
and w = Crn _1 (5 s /^ +2 ' with C small positive constants depending on 
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6, A, q and s. Now from (8.29), the differentiability assumption is 
satisfied for £ = CB q = CQ- q / (s+2 l From (8.17) and (8.21), we 
obtain 

sup {ER{g) - min R(g)\ > CQ 1 ^ . 
10.14 Proof of Theorem 8.16 

The starting point is similar to the one in Section 10.13.2. Since q = 2, 
(10.29) simplifies into B < b^n)' 1 / 2 . We take B = 6(dii)~ 1/2 and 
w = A/(mB 2 ) and we optimize the parameter d\\ in order to maximize 
the lower bound. From (8.30), we get rhwd\ = Ad\\. Introducing 
a = nwdu = j^z(du) 2 , we obtain rhwdi = b^J Am/ri^fa. The results 
then follow from Corollary 8.8 and the fact that the differentiability 
assumption (8.9) holds for ( = 8B 2 = ||. 

A Computations of the second derivative 
of (f> for the L g -loss 

Let yi and y 2 be fixed. We start with the computation of 4> yi ,y 2 - For 
any p G [0; 1], the quantity ip p , yu y 2 (y) = p\y - yi\ q + (1 - p)\y - y 2 \ q is 
minimized when y e [yi A y 2 ; yi V y 2 ] and pq(y- yiY'^ 1 = 0--p)q(y2~ 
y) 9_1 . Introducing r = -^—^ and D = p r + (1 — p) r , the minimizer can 

be written as y — p yi+ ^ p ^ V2 and the minimum is 

KM = (p L1 1 f 1 + (i-p) i m)\y2-yi\' 1 
= p(i-p)^^, 

where we use the equality rq = 1 + r. We get 

^ +p(l -p)(l - ^r^b'- 1 - (1 -p)- 1 ] 
D-iUl - 2p)[p r + (1 - p) r ] - (1 - p)p r + p(l - p) 
£-«{(! -p) r+1 -p r+1 }, 



-grD-*- 1 ^- 1 - (l-p) r ' 1 ][(l -p) r+1 -p r+1 ] 
- 9 rL>-«- 1 [p r - (1 -p) r ] 2 

2-g 

g [P(l-P)] 

'- 1 [^+( 1 -P)^ T ] ,+ 1 ' 



hence 



|l/2-l/i|«^i.l'aW 
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B Expected risk bound from Hoeffding's 
inequality 

Let A' > and p be a probability distribution on Q. Let r(g) de- 
note the empirical risk of a prediction function g, that is r(g) — 
■i X)"=i L(Zi, g). Hoeffding's inequality applied to the random variable 
W = Eg^p L(Z, g) — L(Z, g') e [—(6 — a); b — a] for a fixed g' gives 

E Z ^ P e v[W-MW] < e n 2 (b-a) 2 /2 

for any r) > 0. For v = A'/n, this leads to 

^ z „ e ^'[R{9')-K g ~p R(g)-r(g')+E g ^ p r(g)} < e (A') 2 (&-a) 2 /(2«) 

Consider the Gibbs distribution p = 7r_A'r- This distribution satisfies 
E 9 ^p r(</) + A'(p, tt)/A' < E 9 „, r(g) + A'fo tt)/A'. 

We have 

Ez f E^p i?(.g')-E 9 ~ P 

< Ez» {e 9 ^ [#(«/) - E g „ p R(g) - r(g') r(g)] + K ^% K ^) } 

< £&£2 + £ l 0g E a /^ w E^W)-^-, R(s)-r( S ')-E 3 ~ P r(B)] 

< g(p,7r) A'(fc-q) 2 
— A' 2n ' 

This proved that for any A > 0, the generalization error of the al- 
gorithm which draws its prediction function according to the Gibbs 
distribution K-\Y> n /2 satisfies 

'A(b-q) 2 K(p,n) 



E^E gW 2 R(g') < min \E g „ p R(g) - 

where we use the change of variable A = 2A'/n in order to underline 
the difference with (6.4). 



C Proof of Inequality (10.16) 



To prove (10.16), we need to uniformly control the difference between 
the tail of the sum of i.i.d. random variables and the gaussian approx- 
imate. This is done by the following result. 

Theorem C.l (Berry[15]-Esseen[36] inequality). Let N be a centered 
gaussian variable of variance 1 . Let U\ , . . . , U n be real-valued inde- 
pendent identically distributed random variables such that EC/i = 0, 
Wl = 1 and E|C/i| 3 < +oo. Then 



sup 



i~ 1/2 Ei=iUi >x) -¥{N>x) 
for some universal positive constant C . 



< Cn-^ElUtf (c.l) 
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To shorten the notation, let P~ = P® r } 1 and P+ = P®" be the 
77-fold product of the representatives of the hypercube. Since we have 
p + > 1/2 > p_ (by definition of a (to, w, (iii)-hypercube (p. 32)), the 
set of sequences Z™ for which p=(Z") < 1 is 

E = { Yh=i 1 Y i =h 1 (x z ),x,ex 1 < SILi 1 Y i =hi(x i ),x l £X 1 ) 
Introduce the quantity 

S n — Y^i=l ( 21 Y' I =7ii(X I ) - l)lx s GA' 1 - 

We have 

S A (P+,P~) = 1-P-(E)+P+(E) 

= l-P-(S n <0) + P+(S n <0) (C.2) 
= P-(5 n >0)+P"(5 n >0). 

Introduce 

^ = (21^=^(^-1)1x6^. 

From now on, we consider that the pairs Z; = (Xi,Yi) are generated 
by P~, so that EWi and Y&r Wi simply denote the expectation and 
variance of Wi when (Xi,Yi) is drawn according to P-i,i,...,i. Define 
the normalized quantity 

Ui = (Wi - WV l )/\/WaFW i . 

We have 

p-(S n > 0) = P-(t7- 1 /2 E « =i xj. > tn ) 

where t„ = — J YaxW l^^ 1 • ^ Bcrry-Esscen's inequality (Theorem 
C.l), we get 

|P-(S„ > 0) - P(JV > t„)| < Cn-VSEi^ja ( C 3) 

Let us now upper bound n-^EIt/il 3 . 

Since we have p + = (1 + £)/2 = 1 — p_ for a (to, u>, (fn)-hypcrcube 
(p. 32)), the law of W\ is described by 

P(Wi = 1) = 
P(Wi = 0) = 1 - tu 
P(Wi = -1) = iyi±£ 

where w still denotes /i(Afi). We get EW\ = — u>£, Var W\ = w(l— u>£ 2 ) 
and since < w < 1 and < £ < 1 

E|Wi-EWi| 3 = {l-w)(w0 3 +w^(l+w£) 3 + w^(l-wZ) 3 

< w + w[l + 3(w£) 2 } 

< 5w. 

(C.4) 
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Wc obtain 



n-V^El^l 3 < n- 1 / 2 -, 5 ™ 2 <5 <" m j;, 1 / / 2 2 — > 0. 
1 1 - [u,(i-< 2 )]3/ 2 - (i-e 2 )-v- nw ^ +OOj€ ^ 

From (C.3), we get that \P~(S n > 0) - ¥(N > t n )\ converges to 
zero when nw goes to infinity and £ goes to zero. Now the previous 
convergence also holds when '>' is replaced with '>' (since it suffices 
to consider the random variables — {/, in Theorem C.l). Using both 
convergence and (C.2), we obtain 

\S A (P+,P-)-¥(\N\>t n )\ — ► 0, 



which is the desired result since t n = — / Ya ™ Wl WVi = J y^w^- 



nwdu 
t—wdu 



and 



nw— »+oc,c£n — *0 



D Towards adaptivity for the temperature 
parameter 

Once the distribution tt is fixed, an appropriate choice for the param- 
eter A of the SeqRand algorithm is the minimizer of the r.h.s. of (3.2). 
This minimizer is unknown by the statistician. This section proposes 
to modify the A during the iterations so that it automatically fits to 
a value close to this minimizer. The adaptive SeqRand algorithm is 
described in Figure 7. The idea of incremental updating of the temper- 
ature parameter to solve the adaptivity problem has been successfully 
developed in [10, Section 2] and [30, Lemma 3]. Here we improve the 
argument by using Lemma D.2. 

The following theorem upper bounds the generalization error of the 
adaptive SeqRand algorithm. 

Theorem D.l. Let A x {g,g') = E Z ~ P 6\(Z,g,g') for g e G and 
g' € Q , where we recall that 6\ is a function satisfying the variance in- 
equality (see p. 7). The expected risk of the adaptive SeqRand algorithm 
satisfies 



E z ~E g ,^ a R{g') < mm , E„%) 



n+i("+l) 



Vp Ez»E»» £:SUA *ii l(g, * < 
^i So n+1 



(D.2) 
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Input: 

• Ai > A2 > • • • > A ra+ i > with Aj-f_i possibly depending on the 
values oi Z\, . . . , Zi 

• 7r a distribution on the set Q 

1. Define po = tt{tt) in the sense of the variance inequality (p. 7) and draw 
a function g$ according to this distribution. Let So(g) = for any 

2. For any % 6 {1, . . . , n}, iteratively define 

Si(g) = Si-i(g) + L(Zi,g) + 8 Xi (Z i ,g,g i _ 1 ) for any g £ Q. (D.l) 
and 

/3j = 7r(7r_^ gj in the sense of the variance inequality (p. 7) 

and draw a function gi according to the distribution pi. 

3. Predict with a function drawn according to the uniform distribution 
on the finite set {go , . . . ,g n }. 

Conditionally to the training set, the distribution of the output predic- 
tion function will be denoted p, a . 



Figure 7: The adaptive SeqRand algorithm 
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Proof. Let £ denote the expected risk of the adaptive SeqRand algo- 
rithm: 

£ 4 E zr E 9 ^ a R(g) = ^ Er=o E ^%~n< W 

We recall that Z n +i is a random variable independent of the training 
set Zy and with the same distribution P. Define S n +i by (D.l) for 
i = n+ 1. To shorten formulae, let 7Ti = 7r_,\ i+1 Si so that by definition 
we have pt = ■nijti). The variance inequality implies that 

E„/~*(„) i?(.g') < -^E^,^ logE^ p e -Am[^)+<WW)]. 

So for any i G {0, . . . , n}, for fixed g % _1 = (go, ■ • ■ , ffi-i) and fixed 
we have 

E^* #(<?') < -^E^E^ logE^ #i e -A i+1 [i(^ +1 , 9 )+<5x i+1 (^ +1>S , S ')] 

Taking the expectations w.r.t. (ZJ,5q _1 ), we get 
E^E^i?^) = E^Ejj-iE^ft 

Consequently, by the chain rule (i.e. cancellation in the sum of loga- 
rithmic terms; [11]) and by intensive use of Fubini's theorem, we get 

(D.3) 

with a, = -^A-logE^ e -A i+ i[-L(z i+ i, g )+5A i+I (^+i,s,s«)]. introduce 

the function ^(A) = ilogEg^^ e~ XSi ^ 9 \ Let us now concentrate on 
the last sum. 

T n a- = -T n i loc ^~^~ A ' +lS ' +l(9) N | 

Z^i=O u * — ZjtO A i+1 1U 5 ^ Es ^ e -J >+ i s .(») J 

= bgE^ e -^:rs„ +1 ( 9 ) + E » =o [<MA . +l) _ 0i(Ai)] 

(D.4) 

where the last inequality uses that A^+i < Xi and that the functions 
<pi are nondecreasing according to the following lemma. 

Lemma D.2. Let W be a real-valued measurable function defined on 
the space A and let pi be a probability distribution on A. The mapping 
X i— > jlogE a ^ M e _AW ' a - ) is nondecreasing on the interval on which it 
is defined. 
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Proof of Lemma D.2. It is easy to check that the function is well- 
defined on an interval (possibly empty), and is smooth on the interior 
of this interval. The derivative of this function is 

□ 

Plugging the inequality (D.4) into (D.3) and using Lemma 3.2, we 
obtain 

£ < -^E zr+1 E^^logE 9 _e-^+^+^) 

= titt e ^ +i %™S { Eff ~ p ^ [ L ^a) + Sx t {Z i ,g,§ i -i)]+^g-} 

< min {E s ^ p R(g) + E g ^ p E^E^ E^ 1 ^(9,9i-i) + A „+|(^i) } 

□ 
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